In [135]:
from sklearn.metrics import silhouette_samples, silhouette_score
from pandas.plotting import scatter_matrix
from IPython.display import Image
from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt
import seaborn as sns
import re
#from mpl_toolkits.basemap import Basemap
import pandas as pd
import numpy as np
import math


from IPython.display import Markdown

from collections import Counter
from scipy.cluster.hierarchy import dendrogram, linkage  
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering

from sklearn.manifold import TSNE
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import KFold

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

Business Understanding

It is no secret that the objective of every business is to increase their size whether it be sales or market share which is indicative of growth. Only recently did we hear that Apple has hit a company valuation of one trillion Dollars. One colossal driver for that success is the behind-the-scenes market analytics that researches the consumers willing to spend thousands of dollars on tech products. This is why consumer data is probably one of the most sought after and inaccessible type of data category in the world.

Insights into your customers = Actional intel = $

We were lucky to find a dataset which contained two years’ worth of transactions for a UK based online retail store. The company sells “unique all-occasion gifts.” The dataset, “E-Commerce Data,” was downloaded from Kaggle: https://www.kaggle.com/carrie1/ecommerce-data

For the last 50 years, one extremely powerful method of understanding a business’s market or customer base is through market segmentation. Companies that try to market products to the masses or approach their marketing through a “one-size-fits-all” methodology will eventually fail. Market segmentation divides the customer base into groups of individuals that are similar in specific ways. The following tables illustrates the core ways of segmenting customers:

In [136]:
display(Image('https://s3-eu-west-1.amazonaws.com/tutor2u-media/subjects/business/diagrams/marketing-segmentation-bases1.jpg', width=700, unconfined=True))
display(Image('https://s3-eu-west-1.amazonaws.com/tutor2u-media/subjects/business/diagrams/marketing-segments-intro.jpg', width=700, unconfined=True))

The benefits of this practice is invaluable to the business and include:

  1. Marketing efficiency: 80/20 rule focus on the customers that add more value to the business.
  2. Identifying and determining new market opportunities: Through clustering, we can possibly identify missed opportunities.
  3. Improved distribution strategies: Clustering using geographical data allows us to funnel more resources towards those areas.
  4. Customer Loyalty: Clustering may possibly identify behavioral and psychological behaviors that drives customer retention towards the business. In this project, we will lead an investigation using clustering-based methodology to identify and then improve market segmentation for the UK business. This chosen validation method is proven and practiced all over the world. The most prominent segmentation will be either customer sales which is indicative of size or value to the business or customer geography.

We will follow the protocols outlined in the BLT section 10.8 – Clustering Evaluation to measure the effectiveness of our algorithms/methods. This includes using cluster validity estimation which a measure of goodness of created clusters. Although such measure is inherently subjective, we can still determine clustering tendency of the data set. We can also compare the results of two different sets of cluster analysis. We will also do a visual inspection and determine a similarity matrix. Finally, we will be examining the silhouette score as well.


As discussed previously, our chosen validation methods have been suggested by professor Larson and are generally accepted as industry standard. We’ve also determined that for the stakeholder, if the clustering conclusions resulted in actional insights that the business can capitalize on, then we have met their needs.

Table Sources: https://www.tutor2u.net/business/reference/market-segmentation

Data Understanding

Data Exploration:

There are 541,909 records in our online marketing data research. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Following are the factor names and descriptions:

Factor Name

Type

Description

InvoiceNo

Nominal

A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation

StockCode

Nominal

A 5-digit integral number uniquely assigned to each distinct product.

Description

Nominal

Product (item) name.

Quantity

Numeric

The quantities of each product (item) per transaction.

InvoiceDate

Numeric

The day and time when each transaction was generated

UnitPrice

Numeric

Product price per unit in sterling.

CustomerID

Nominal

A 5-digit integral number uniquely assigned to each customer.

Country

Nominal

The name of the country where each customer resides


In [137]:
# Import Online marketing data
marketing_data = pd.read_excel('./Data/Online Retail.xlsx')
In [138]:
marketing_data.head()
Out[138]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
In [139]:
#Data types information of each attributes:  
marketing_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
In [140]:
# Find the number of missing values in each column:
print(marketing_data.isnull().sum(axis=0))

na_cols = marketing_data.loc[:,marketing_data.isnull().mean()>.0].columns

for col in na_cols:
    print(col, ' column is ',round(marketing_data[col].isnull().mean(),5)*100,'% null')
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
Description  column is  0.268 % null
CustomerID  column is  24.927 % null
In [141]:
plt.figure(figsize=(16,10))
plt.xlim(0,1)
plt.suptitle('Percentage of null values per column', size=20)
marketing_data.isnull().mean().plot.barh();
plt.show()

When looking at the number of missing values in each attribute, we found there are 135,080 records, almost 25% of our data set, where CustomerID was null. Part of our research is based on identifying high value customers. We looked at creating a unique CustomerID for each country where Customer ID is null. This created a highly valuable single customer, which most likely does not exist. We decided to remove the records where CustomerID was null for this part of the research. However, part of our research also looks to develop Association Rules. This part of the research benefits from retaining as much data as possible. Our final decision was to use a data set with CustomerID removed where CustomerID was an important feature, and use a second data set retaining all CustomerID's that are null and assigning a unique number for each country.

In [142]:
# Make the duplicate of the data set
marketing_data_copy = marketing_data.copy()

# Remove the rows with missing in CustomerID column:
marketing_data.dropna(subset=['CustomerID'], inplace=True)
In [143]:
# Find the first and last orders in our dataset:
marketing_data['InvoiceDate'].min()
Out[143]:
Timestamp('2010-12-01 08:26:00')
In [144]:
marketing_data['InvoiceDate'].max()
Out[144]:
Timestamp('2011-12-09 12:50:00')

This dataset includes orders done between Dec 2010 and Dec 2011.

In [145]:
marketing_data.shape
Out[145]:
(406829, 8)
In [146]:
# Now we have both datasets with no null values
marketing_data.isnull().sum(axis=0)
Out[146]:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

Now, checking to identify duplicates in our dataset, we found 5,225 duplicate rows and removed them.

In [147]:
# Find and remove the duplicate records
display('Duplicated records: {}'.format(marketing_data.duplicated().sum()))
marketing_data.drop_duplicates(inplace = True)
'Duplicated records: 5225'
In [148]:
marketing_data.shape
Out[148]:
(401604, 8)

We are replacing all descriptions that don't have any letters or are not strings.

In [149]:
for ind,el in enumerate(marketing_data['Description']):
    if type(el) != str:
        marketing_data.loc['Description',ind]='Unknown'
    elif re.match('.*[a-zA-Z]+.*',el) is not None:
        pass
    else:
        marketing_data.loc[ind,'Description']='Unknown'

$\textbf{New Features}$
Our research required we created new features to helps us better understand the data perform some the analysis we were interested in. These are the new features:

Feature

Description

TotalPrice

It is the total price of each items (Quantity * Unit Price)

Recency

It is the number of days since last purchase for each invoice and customer

Frequency

Total number of purchases for each customer (it will be imputed in groupby customer section)

Monetary

Total money that each customer spent (it will be imputed in groupby customer section)

In [150]:
# Create the TotalPrice feature
marketing_data['TotalPrice'] = marketing_data['Quantity']*marketing_data['UnitPrice']
In [151]:
# Create Recency Feature for each invoice:
Current=marketing_data['InvoiceDate'].max()
marketing_data['Recency']=(Current-marketing_data['InvoiceDate']).astype('timedelta64[D]')
In [152]:
marketing_data['Recency']=marketing_data['Recency'].astype(float)

Here the Recency column has timedelta64[ns] data type. For easier use in our analysis, we extract the days in integer format, representing the number of days since the maximum date in our dataset.

In [153]:
marketing_data.head()
Out[153]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice Recency
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30 373.0
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 373.0
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00 373.0
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 373.0
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34 373.0
In [154]:
marketing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 401604 entries, 0 to 541908
Data columns (total 10 columns):
InvoiceNo      401604 non-null object
StockCode      401604 non-null object
Description    401604 non-null object
Quantity       401604 non-null int64
InvoiceDate    401604 non-null datetime64[ns]
UnitPrice      401604 non-null float64
CustomerID     401604 non-null float64
Country        401604 non-null object
TotalPrice     401604 non-null float64
Recency        401604 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(1), object(4)
memory usage: 33.7+ MB

$\textbf{Examine Data Grouped by Country}$
Create a data frame that shows the number of unique feautures in each country. We are looking to see if there is any interesting data that shows up based on which countries orders come from. We want to see which specific countries buy the most items at one time, buys the most expensive items, and spends the most money on any single item. This information may be actionable by the business.

In [155]:
per_country = marketing_data.groupby(['Country']).nunique()
per_country= per_country.drop('Country',axis=1)
per_country.reset_index(level=0, inplace=True)
per_country[['Country','InvoiceNo','StockCode','Description','Quantity','InvoiceDate','UnitPrice','CustomerID']]
Out[155]:
Country InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
0 Australia 69 600 609 67 66 77 9
1 Austria 19 307 307 23 19 44 11
2 Bahrain 2 16 16 8 2 8 2
3 Belgium 119 778 785 38 119 66 25
4 Brazil 1 32 32 7 1 16 1
5 Canada 6 147 147 19 6 30 4
6 Channel Islands 33 430 436 25 33 55 9
7 Cyprus 20 498 498 29 20 56 8
8 Czech Republic 5 25 25 12 5 17 1
9 Denmark 21 251 252 35 21 44 9
10 EIRE 319 1950 1997 88 318 115 3
11 European Community 5 50 50 9 5 20 1
12 Finland 48 458 465 30 48 53 12
13 France 458 1523 1545 74 458 97 87
14 Germany 603 1671 1703 68 598 106 95
15 Greece 6 138 138 16 6 34 4
16 Iceland 7 103 103 15 7 32 1
17 Israel 6 221 221 24 6 51 4
18 Italy 55 477 480 30 55 62 15
19 Japan 28 216 217 47 29 48 8
20 Lebanon 1 45 45 9 1 23 1
21 Lithuania 4 29 29 10 4 10 1
22 Malta 10 99 99 18 10 29 2
23 Netherlands 101 785 805 66 101 84 9
24 Norway 40 591 595 37 40 77 10
25 Poland 24 204 205 24 24 48 6
26 Portugal 70 686 694 32 70 65 19
27 RSA 1 58 58 9 1 23 1
28 Saudi Arabia 2 9 9 4 2 5 1
29 Singapore 10 178 179 24 10 51 1
30 Spain 105 1093 1107 49 105 81 31
31 Sweden 46 261 261 48 45 45 8
32 Switzerland 71 947 954 45 70 69 21
33 USA 7 163 163 32 7 33 4
34 United Arab Emirates 3 68 68 11 3 25 2
35 United Kingdom 19857 3661 3860 426 18441 569 3950
36 Unspecified 8 213 213 14 8 44 4

The chart below shows the number of unique customer ID's in each country.

In [156]:
per_country.sort_values(by=['CustomerID'],ascending=False,inplace=True)
num_cust = per_country['CustomerID']
country = per_country['Country']
plt.figure(figsize=(16,10))
#plt.barh(country, num_cust,)
#plt.xticks(y_pos, country )
#ax = sns.barplot(x=num_cust,y=country,orient='h')
#ax.set(xlabel ='Number of Customer', ylabel='Country')
plt.title('Number of Customers per Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=num_cust,y=country,orient='h')
plt.show()
display(per_country[['CustomerID','Country']])
CustomerID Country
35 3950 United Kingdom
14 95 Germany
13 87 France
30 31 Spain
3 25 Belgium
32 21 Switzerland
26 19 Portugal
18 15 Italy
12 12 Finland
1 11 Austria
24 10 Norway
23 9 Netherlands
0 9 Australia
9 9 Denmark
6 9 Channel Islands
7 8 Cyprus
31 8 Sweden
19 8 Japan
25 6 Poland
33 4 USA
5 4 Canada
36 4 Unspecified
17 4 Israel
15 4 Greece
10 3 EIRE
22 2 Malta
34 2 United Arab Emirates
2 2 Bahrain
8 1 Czech Republic
21 1 Lithuania
20 1 Lebanon
27 1 RSA
28 1 Saudi Arabia
29 1 Singapore
16 1 Iceland
4 1 Brazil
11 1 European Community

The United Kingdom dominates our customer list. In fact over 90% of customers are from the United Kingdom

In [157]:
cust_UK_row = per_country.loc[per_country['Country'] =='United Kingdom']
cust_UK = cust_UK_row['CustomerID'].sum()
cust_ROW_row = per_country.loc[per_country['Country']!='United Kingdom']
cust_ROW = cust_ROW_row['CustomerID'].sum()

display('{0:.2%} of customers are in the United Kingdom'.format((cust_UK/(cust_UK+cust_ROW))))
labels = 'UK Customers','ROW Customers'
cust_count = [cust_UK, cust_ROW]
colors = ['#ff9999','#66b3ff']
explode = [0,0.2]
plt.axis('equal')   
plt.pie(cust_count, labels=labels, explode=explode, colors=colors, autopct='%1.2f%%', shadow=True, startangle=0)
plt.show()
'90.18% of customers are in the United Kingdom'
In [158]:
count_by_country = marketing_data.groupby(['Country']).count()
count_by_country['Avg_UnitPrice'] = marketing_data.groupby(['Country'])['UnitPrice'].mean()
count_by_country['Avg_Quantity'] = marketing_data.groupby(['Country'])['Quantity'].mean()
count_by_country['Avg_TotalPrice'] = marketing_data.groupby(['Country'])['TotalPrice'].mean()
count_by_country.reset_index(level=0, inplace=True)
count_by_country.sort_values(by=['CustomerID'],ascending=False,inplace=True)

This chart below is showing us which country is buying the most of any single item.

In [159]:
count_by_country.sort_values(by=['Avg_Quantity'],ascending=False,inplace=True)
avg_quantity = count_by_country['Avg_Quantity']

country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Items per Line Item by Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=avg_quantity,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_Quantity', 'Country']])
Avg_Quantity Country
23 84.406580 Netherlands
31 77.292842 Sweden
19 70.441341 Japan
0 66.488871 Australia
29 22.855895 Singapore
9 21.048843 Denmark
8 19.733333 Czech Republic
21 18.628571 Lithuania
5 18.298013 Canada
10 18.218997 EIRE
24 17.722836 Norway
17 16.141700 Israel
32 15.864678 Switzerland
12 15.346763 Finland
2 15.294118 Bahrain
34 14.441176 United Arab Emirates
16 13.505495 Iceland
13 12.956460 France
6 12.513871 Channel Islands
14 12.377743 Germany
1 12.037406 Austria
35 11.198644 United Kingdom
3 11.189947 Belgium
4 11.125000 Brazil
26 10.888511 Portugal
25 10.712610 Poland
15 10.657534 Greece
30 10.607991 Spain
7 10.304419 Cyprus
18 9.961395 Italy
20 8.577778 Lebanon
11 8.147541 European Community
28 7.500000 Saudi Arabia
22 7.433071 Malta
36 7.406639 Unspecified
27 6.068966 RSA
33 3.553265 USA

The chart below is showing us which country is buying the most expensive items.

In [160]:
count_by_country.sort_values(by=['Avg_UnitPrice'],ascending=False,inplace=True)
avg_price = count_by_country['Avg_UnitPrice']

country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Unit Price per Line Item by Country')
plt.xlabel('Number of Customers')
total = float(len(count_by_country))
#ax.fig.set_size_inches(16,10)
sns.barplot(x=avg_price,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_UnitPrice', 'Country']])
Avg_UnitPrice Country
29 109.645808 Singapore
26 8.771754 Portugal
7 6.350311 Cyprus
5 6.030331 Canada
24 6.012026 Norway
12 5.448705 Finland
20 5.387556 Lebanon
22 5.244173 Malta
10 5.111607 EIRE
13 5.053124 France
30 4.992682 Spain
6 4.936460 Channel Islands
15 4.885548 Greece
18 4.831121 Italy
11 4.820492 European Community
2 4.644118 Bahrain
4 4.456250 Brazil
27 4.277586 RSA
1 4.243192 Austria
25 4.170880 Poland
14 3.969772 Germany
31 3.914816 Sweden
17 3.670648 Israel
3 3.644335 Belgium
32 3.499521 Switzerland
34 3.380735 United Arab Emirates
35 3.268255 United Kingdom
9 3.256941 Denmark
36 3.219710 Unspecified
0 3.217806 Australia
8 2.938333 Czech Republic
21 2.841143 Lithuania
23 2.738317 Netherlands
16 2.644011 Iceland
28 2.411000 Saudi Arabia
19 2.276145 Japan
33 2.216426 USA

Singapore seems like it is an outlier in this area, buying a single item of a much higher price than any other country. The table below explains this. There were four manual charges to what looks like 4 consecutive customers for some high-priced item, and then the invoices were cancelled. Since we decided earlier to leave cancelled items in the data set we will also leave this here and just note this outlier.

In [161]:
display (marketing_data.loc[(marketing_data['Country']=='Singapore')])
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice Recency
70758 542102 21519 GIN & TONIC DIET GREETING CARD 72 2011-01-25 13:26:00 0.36 12744.0 Singapore 25.92 317.0
70759 542102 22697 GREEN REGENCY TEACUP AND SAUCER 6 2011-01-25 13:26:00 2.95 12744.0 Singapore 17.70 317.0
70760 542102 22699 ROSES REGENCY TEACUP AND SAUCER 6 2011-01-25 13:26:00 2.95 12744.0 Singapore 17.70 317.0
70761 542102 22343 PARTY PIZZA DISH RED RETROSPOT 24 2011-01-25 13:26:00 0.21 12744.0 Singapore 5.04 317.0
70762 542102 22344 PARTY PIZZA DISH PINK POLKADOT 24 2011-01-25 13:26:00 0.21 12744.0 Singapore 5.04 317.0
70763 542102 22345 PARTY PIZZA DISH BLUE POLKADOT 24 2011-01-25 13:26:00 0.21 12744.0 Singapore 5.04 317.0
70764 542102 22346 PARTY PIZZA DISH GREEN POLKADOT 24 2011-01-25 13:26:00 0.21 12744.0 Singapore 5.04 317.0
70765 542102 22960 JAM MAKING SET WITH JARS 12 2011-01-25 13:26:00 3.75 12744.0 Singapore 45.00 317.0
70766 542102 22969 HOMEMADE JAM SCENTED CANDLES 24 2011-01-25 13:26:00 1.45 12744.0 Singapore 34.80 317.0
70767 542102 20971 PINK BLUE FELT CRAFT TRINKET BOX 48 2011-01-25 13:26:00 1.25 12744.0 Singapore 60.00 317.0
70768 542102 20972 PINK CREAM FELT CRAFT TRINKET BOX 48 2011-01-25 13:26:00 1.25 12744.0 Singapore 60.00 317.0
70769 542102 21755 LOVE BUILDING BLOCK WORD 18 2011-01-25 13:26:00 5.45 12744.0 Singapore 98.10 317.0
70770 542102 21756 BATH BUILDING BLOCK WORD 6 2011-01-25 13:26:00 5.95 12744.0 Singapore 35.70 317.0
70771 542102 21465 PINK FLOWER CROCHET FOOD COVER 6 2011-01-25 13:26:00 3.75 12744.0 Singapore 22.50 317.0
70772 542102 21471 STRAWBERRY RAFFIA FOOD COVER 6 2011-01-25 13:26:00 3.75 12744.0 Singapore 22.50 317.0
70773 542102 21469 POLKA DOT RAFFIA FOOD COVER 6 2011-01-25 13:26:00 3.75 12744.0 Singapore 22.50 317.0
70774 542102 22348 TEA BAG PLATE RED RETROSPOT 24 2011-01-25 13:26:00 0.85 12744.0 Singapore 20.40 317.0
70775 542102 21791 VINTAGE HEADS AND TAILS CARD GAME 12 2011-01-25 13:26:00 1.25 12744.0 Singapore 15.00 317.0
70776 542102 21790 VINTAGE SNAP CARDS 72 2011-01-25 13:26:00 0.72 12744.0 Singapore 51.84 317.0
70777 542102 21156 RETROSPOT CHILDRENS APRON 16 2011-01-25 13:26:00 1.95 12744.0 Singapore 31.20 317.0
70778 542102 22367 CHILDRENS APRON SPACEBOY DESIGN 16 2011-01-25 13:26:00 1.95 12744.0 Singapore 31.20 317.0
70779 542102 22938 CUPCAKE LACE PAPER SET 6 24 2011-01-25 13:26:00 1.95 12744.0 Singapore 46.80 317.0
70780 542102 22197 SMALL POPCORN HOLDER 100 2011-01-25 13:26:00 0.72 12744.0 Singapore 72.00 317.0
70781 542102 20685 DOORMAT RED RETROSPOT 10 2011-01-25 13:26:00 6.75 12744.0 Singapore 67.50 317.0
70782 542102 21936 RED RETROSPOT PICNIC BAG 25 2011-01-25 13:26:00 2.95 12744.0 Singapore 73.75 317.0
70783 542102 21094 SET/6 RED SPOTTY PAPER PLATES 36 2011-01-25 13:26:00 0.85 12744.0 Singapore 30.60 317.0
70784 542102 48138 DOORMAT UNION FLAG 10 2011-01-25 13:26:00 6.75 12744.0 Singapore 67.50 317.0
70785 542102 22203 MILK PAN RED RETROSPOT 4 2011-01-25 13:26:00 3.75 12744.0 Singapore 15.00 317.0
70786 542102 22202 MILK PAN PINK POLKADOT 4 2011-01-25 13:26:00 3.75 12744.0 Singapore 15.00 317.0
70787 542102 21977 PACK OF 60 PINK PAISLEY CAKE CASES 24 2011-01-25 13:26:00 0.55 12744.0 Singapore 13.20 317.0
... ... ... ... ... ... ... ... ... ... ...
398345 571239 22197 POPCORN HOLDER 100 2011-10-14 14:54:00 0.72 12744.0 Singapore 72.00 55.0
398346 571239 20682 RED RETROSPOT CHILDRENS UMBRELLA 12 2011-10-14 14:54:00 3.25 12744.0 Singapore 39.00 55.0
398347 571239 22929 SCHOOL DESK AND CHAIR 1 2011-10-14 14:54:00 65.00 12744.0 Singapore 65.00 55.0
398348 571239 20685 DOORMAT RED RETROSPOT 10 2011-10-14 14:54:00 7.08 12744.0 Singapore 70.80 55.0
398349 571239 23284 DOORMAT KEEP CALM AND COME IN 10 2011-10-14 14:54:00 7.08 12744.0 Singapore 70.80 55.0
398350 571239 48138 DOORMAT UNION FLAG 10 2011-10-14 14:54:00 7.08 12744.0 Singapore 70.80 55.0
398351 571239 22689 DOORMAT MERRY CHRISTMAS RED 10 2011-10-14 14:54:00 7.08 12744.0 Singapore 70.80 55.0
398352 571239 23311 VINTAGE CHRISTMAS STOCKING 6 2011-10-14 14:54:00 2.55 12744.0 Singapore 15.30 55.0
398353 571239 21181 PLEASE ONE PERSON METAL SIGN 24 2011-10-14 14:54:00 2.10 12744.0 Singapore 50.40 55.0
398354 571239 21175 GIN + TONIC DIET METAL SIGN 12 2011-10-14 14:54:00 2.55 12744.0 Singapore 30.60 55.0
398355 571239 22413 METAL SIGN TAKE IT OR LEAVE IT 6 2011-10-14 14:54:00 2.95 12744.0 Singapore 17.70 55.0
398356 571239 85150 LADIES & GENTLEMEN METAL SIGN 12 2011-10-14 14:54:00 2.55 12744.0 Singapore 30.60 55.0
398357 571239 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398358 571239 84997A CHILDRENS CUTLERY POLKADOT GREEN 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398359 571239 84997C CHILDRENS CUTLERY POLKADOT BLUE 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398360 571239 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398361 571239 84997B CHILDRENS CUTLERY RETROSPOT RED 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398362 571239 84997D CHILDRENS CUTLERY POLKADOT PINK 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398363 571239 23256 CHILDRENS CUTLERY SPACEBOY 4 2011-10-14 14:54:00 4.15 12744.0 Singapore 16.60 55.0
398364 571239 22343 PARTY PIZZA DISH RED RETROSPOT 72 2011-10-14 14:54:00 0.72 12744.0 Singapore 51.84 55.0
398365 571239 84596B SMALL DOLLY MIX DESIGN ORANGE BOWL 16 2011-10-14 14:54:00 0.42 12744.0 Singapore 6.72 55.0
398366 571239 21947 SET OF 6 HEART CHOPSTICKS 12 2011-10-14 14:54:00 1.25 12744.0 Singapore 15.00 55.0
398367 571239 21948 SET OF 6 CAKE CHOPSTICKS 12 2011-10-14 14:54:00 1.25 12744.0 Singapore 15.00 55.0
398368 571239 21949 SET OF 6 STRAWBERRY CHOPSTICKS 12 2011-10-14 14:54:00 1.25 12744.0 Singapore 15.00 55.0
398369 571239 23229 VINTAGE DONKEY TAIL GAME 6 2011-10-14 14:54:00 3.75 12744.0 Singapore 22.50 55.0
398370 571239 21888 BINGO SET 12 2011-10-14 14:54:00 3.75 12744.0 Singapore 45.00 55.0
406404 C571750 M Manual -1 2011-10-19 11:16:00 3949.32 12744.0 Singapore -3949.32 51.0
406405 C571750 M Manual -1 2011-10-19 11:16:00 2118.74 12744.0 Singapore -2118.74 51.0
406406 571751 M Manual 1 2011-10-19 11:18:00 3949.32 12744.0 Singapore 3949.32 51.0
406407 571751 M Manual 1 2011-10-19 11:18:00 2118.74 12744.0 Singapore 2118.74 51.0

229 rows × 10 columns

The table below is showing which country has the highest average Total Line Item Price. The order in this list in similar to the average number of line items table above. It makes sense that the country that buy the most of a single item will also have the most amount of money spent on a single item. We can take note the outlier found above for Singapore may affect their standing in this chart. The outlier was previously noted.

In [162]:
count_by_country.sort_values(by=['Avg_TotalPrice'],ascending=False,inplace=True)
avg_sale = count_by_country['Avg_TotalPrice']

country = count_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Total Price per Line Item by Country')
plt.xlabel('Number of Customers')
#ax.fig.set_size_inches(15,15)
sns.barplot(x=avg_sale,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_TotalPrice', 'Country']])
Avg_TotalPrice Country
23 120.059696 Netherlands
0 108.910787 Australia
19 98.716816 Japan
31 79.360976 Sweden
9 48.247147 Denmark
21 47.458857 Lithuania
29 39.827031 Singapore
20 37.641778 Lebanon
4 35.737500 Brazil
10 33.445054 EIRE
24 32.378877 Norway
15 32.263836 Greece
2 32.258824 Bahrain
12 32.124806 Finland
32 29.696004 Switzerland
17 28.293117 Israel
34 27.974706 United Arab Emirates
6 26.520991 Channel Islands
1 25.322494 Austria
5 24.280662 Canada
16 23.681319 Iceland
8 23.590667 Czech Republic
14 23.365978 Germany
13 23.200714 France
30 21.659822 Spain
11 21.176230 European Community
25 21.152903 Poland
7 21.045434 Cyprus
18 21.034259 Italy
3 19.773301 Belgium
22 19.728110 Malta
26 19.711598 Portugal
35 18.914008 United Kingdom
27 17.281207 RSA
28 13.117000 Saudi Arabia
36 11.040539 Unspecified
33 5.948179 USA

Here we examine each transaction's Total Price by country, looking for outliers. Since Total Price is a combination of Quantity and Unit Price, we will be able to check both of these features with this chart.

In [163]:
marketing_data['TotalPrice'] = marketing_data['UnitPrice'] *  marketing_data['Quantity']
plt.figure(figsize=(16,10))
sns.boxplot(x='TotalPrice', y='Country', data=marketing_data, orient='h')
display (marketing_data.describe())
Quantity UnitPrice CustomerID TotalPrice Recency
count 401604.000000 401604.000000 401604.000000 401604.000000 401604.000000
mean 12.183273 3.474064 15281.160818 20.613638 151.499186
std 250.283037 69.764035 1714.006089 430.352218 112.722740
min -80995.000000 0.000000 12346.000000 -168469.600000 0.000000
25% 2.000000 1.250000 13939.000000 4.250000 50.000000
50% 5.000000 1.950000 15145.000000 11.700000 132.000000
75% 12.000000 3.750000 16784.000000 19.800000 246.000000
max 80995.000000 38970.000000 18287.000000 168469.600000 373.000000

The graph above is not very granular, however we can see some outliers from the United Kingdom. The outlier seems to be symmetric so we need to examine these. The outliers appear to all be above 3500 and below -$3500. Below we look for the transactions that meet this criterion.

In [164]:
min_sales = marketing_data.loc[(marketing_data['TotalPrice'] < -35000)]
max_sales = marketing_data.loc[(marketing_data['TotalPrice'] > 35000)]

display(min_sales)
display(max_sales)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice Recency
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2011-01-18 10:17:00 1.04 12346.0 United Kingdom -77183.6 325.0
222681 C556445 M Manual -1 2011-06-10 15:31:00 38970.00 15098.0 United Kingdom -38970.0 181.0
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2011-12-09 09:27:00 2.08 16446.0 United Kingdom -168469.6 0.0
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country TotalPrice Recency
61619 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 2011-01-18 10:01:00 1.04 12346.0 United Kingdom 77183.6 325.0
222680 556444 22502 PICNIC BASKET WICKER 60 PIECES 60 2011-06-10 15:28:00 649.50 15098.0 United Kingdom 38970.0 181.0
540421 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 2011-12-09 09:15:00 2.08 16446.0 United Kingdom 168469.6 0.0

The Invoice numbers for all the high value negative transactions have a 'C' at the beginning. We can speculate these are cancelled orders. The Stock codes for two of the three matching transactions also match. This is further proof these are not mistakes, but actual orders that were cancelled. We will retain them in the data set.

Grouping Values by Invoice Number

Here we want to collect all the relevant information for each unique invoice. Invoices are spread all over the data set since each row represents a single line item in a unique invoice.

This data set groups by InvoiceNo which results in having one row representing one unique invoice, total transaction value of that invoice, customer, country, invoice date, total quantity and descriptions plus all other information related to the invoice.

In [165]:
by_invoice = marketing_data.groupby(['InvoiceNo', 'InvoiceDate','Country','CustomerID', 'Recency'], as_index=False)[ 'UnitPrice','Description','Quantity','TotalPrice'].agg(lambda x: list(x))
by_invoice['TotalPrice'] = by_invoice['TotalPrice'].apply(sum)
by_invoice['QuantityTotal'] = by_invoice['Quantity'].apply(sum)
by_invoice['Description_as_string'] = by_invoice['Description'].apply(lambda x: "-".join(sorted(x)))
In [166]:
by_invoice.head()
Out[166]:
InvoiceNo InvoiceDate Country CustomerID Recency UnitPrice Description Quantity TotalPrice QuantityTotal Description_as_string
0 536365 2010-12-01 08:26:00 United Kingdom 17850.0 373.0 [2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25] [WHITE HANGING HEART T-LIGHT HOLDER, WHITE MET... [6, 6, 8, 6, 6, 2, 6] 139.12 40 CREAM CUPID HEARTS COAT HANGER-GLASS STAR FROS...
1 536366 2010-12-01 08:28:00 United Kingdom 17850.0 373.0 [1.85, 1.85] [HAND WARMER UNION JACK, HAND WARMER RED POLKA... [6, 6] 22.20 12 HAND WARMER RED POLKA DOT-HAND WARMER UNION JACK
2 536367 2010-12-01 08:34:00 United Kingdom 13047.0 373.0 [1.69, 2.1, 2.1, 3.75, 1.65, 4.25, 4.95, 9.95,... [ASSORTED COLOUR BIRD ORNAMENT, POPPY'S PLAYHO... [32, 6, 6, 8, 6, 6, 3, 2, 3, 3, 4, 4] 278.73 83 ASSORTED COLOUR BIRD ORNAMENT-BOX OF 6 ASSORTE...
3 536368 2010-12-01 08:34:00 United Kingdom 13047.0 373.0 [4.25, 4.95, 4.95, 4.95] [JAM MAKING SET WITH JARS, RED COAT RACK PARIS... [6, 3, 3, 3] 70.05 15 BLUE COAT RACK PARIS FASHION-JAM MAKING SET WI...
4 536369 2010-12-01 08:35:00 United Kingdom 13047.0 373.0 [5.95] [BATH BUILDING BLOCK WORD] [3] 17.85 3 BATH BUILDING BLOCK WORD

This chart below is showing us which country spends the most money on average on each invoice. It is not surprising to see the Netherlands leading this chart as this country was at the top of Average Price per Line Item and the Average Number of Items per Line Item lists.

In [167]:
invoice_by_country = by_invoice.groupby(['Country']).count()
invoice_by_country['Avg_TotalPrice'] = by_invoice.groupby(['Country'])['TotalPrice'].mean()

invoice_by_country.reset_index(level=0, inplace=True)
invoice_by_country.sort_values(by=['Avg_TotalPrice'],ascending=False,inplace=True)


avg_invoice = invoice_by_country['Avg_TotalPrice']
country = invoice_by_country['Country']
plt.figure(figsize=(16,10))
plt.title('Average Total Invoice per Country')
plt.xlabel('Number of Customers')
sns.barplot(x=avg_invoice,y=country,orient='h')
plt.show()
display(count_by_country[['Avg_TotalPrice', 'Country']])
Avg_TotalPrice Country
23 120.059696 Netherlands
0 108.910787 Australia
19 98.716816 Japan
31 79.360976 Sweden
9 48.247147 Denmark
21 47.458857 Lithuania
29 39.827031 Singapore
20 37.641778 Lebanon
4 35.737500 Brazil
10 33.445054 EIRE
24 32.378877 Norway
15 32.263836 Greece
2 32.258824 Bahrain
12 32.124806 Finland
32 29.696004 Switzerland
17 28.293117 Israel
34 27.974706 United Arab Emirates
6 26.520991 Channel Islands
1 25.322494 Austria
5 24.280662 Canada
16 23.681319 Iceland
8 23.590667 Czech Republic
14 23.365978 Germany
13 23.200714 France
30 21.659822 Spain
11 21.176230 European Community
25 21.152903 Poland
7 21.045434 Cyprus
18 21.034259 Italy
3 19.773301 Belgium
22 19.728110 Malta
26 19.711598 Portugal
35 18.914008 United Kingdom
27 17.281207 RSA
28 13.117000 Saudi Arabia
36 11.040539 Unspecified
33 5.948179 USA

Grouping Values by Customer ID

Here our research is based on customers, therefore it is useful to group our dataset by Customer ID and create two other new features (Frequency and Monetary) for each customer from there.

In [168]:
by_customer = by_invoice.groupby(['CustomerID'], as_index=False)['InvoiceNo', 'InvoiceDate', 'Recency', 'Description','Quantity','TotalPrice'].agg(lambda x: list(x))

by_customer['Frequency'] = by_customer['InvoiceNo'].apply(lambda x: len(x))
by_customer['Monetary'] = by_customer['TotalPrice'].apply(sum)
by_customer['Recency'] = by_customer['Recency'].apply(min)
by_customer['Quantity'] = by_customer['Quantity'].apply(lambda x: sum(x,[]))
by_customer['QuantityTotal'] = by_customer['Quantity'].apply(sum)
#by_customer['Description_joined'] = by_customer['Description'].apply(lambda x: "-".join(sorted(x[0])))
In [169]:
by_customer.head()
Out[169]:
CustomerID InvoiceNo InvoiceDate Recency Description Quantity TotalPrice Frequency Monetary QuantityTotal
0 12346.0 [541431, C541433] [2011-01-18 10:01:00, 2011-01-18 10:17:00] 325.0 [[MEDIUM CERAMIC TOP STORAGE JAR], [MEDIUM CER... [74215, -74215] [77183.6, -77183.6] 2 0.00 0
1 12347.0 [537626, 542237, 549222, 556201, 562032, 57351... [2010-12-07 14:57:00, 2011-01-26 14:30:00, 201... 1.0 [[BLACK CANDELABRA T-LIGHT HOLDER, AIRLINE BAG... [12, 4, 12, 36, 12, 12, 12, 12, 12, 12, 4, 4, ... [711.79, 475.39, 636.25, 382.52, 584.91, 1294.... 7 4310.00 2458
2 12348.0 [539318, 541998, 548955, 568172] [2010-12-16 19:09:00, 2011-01-25 10:42:00, 201... 74.0 [[72 SWEETHEART FAIRY CAKE CASES, 60 CAKE CASE... [72, 72, 24, 120, 24, 120, 72, 144, 144, 48, 7... [892.8000000000001, 227.43999999999997, 367.0,... 4 1797.24 2341
3 12349.0 [577609] [2011-11-21 09:51:00] 18.0 [[PARISIENNE CURIO CABINET, SWEETHEART WALL TI... [2, 2, 6, 3, 6, 6, 12, 2, 12, 2, 12, 12, 6, 3,... [1757.55] 1 1757.55 631
4 12350.0 [543037] [2011-02-02 16:01:00] 309.0 [[CHOCOLATE THIS WAY METAL SIGN, METAL SIGN NE... [12, 12, 10, 12, 24, 6, 12, 12, 12, 12, 12, 12... [334.40000000000003] 1 334.40 197

Next, we examine the distribution of the Frequency and Recency features we created. The Recency of people purchasing from us is on average just under 100 days. The Frequency, which shows how often people buy, reveals we have quite a few good customers, with at least one customer buying from us 249 times. Further research to identify these high repeat customers would something to do in the future.

In [170]:
desc_rec = by_customer['Recency'].describe()
desc_freq = by_customer['Frequency'].describe()
list_of_desc = [desc_rec, desc_freq]
fig = plt.figure(figsize=(8,12))
sub1 = fig.add_subplot(2,2,1)
sub1.set_xlabel('Recency')
sub1.set_ylabel('Days')
sub1.boxplot(by_customer['Recency'],showmeans=True,)
sub2 = fig.add_subplot(2,2,2)
sub2.set_xlabel('Frequency')
sub2.set_ylabel('Days')
sub2.boxplot(by_customer['Frequency'],showmeans=True)
plt.show()
#sub3 = fig.add_subplot(2,2,3)
#ub3.plot.table(desc_rec)
display (desc_rec)
display (desc_freq)
count    4372.000000
mean       91.047118
std       100.765435
min         0.000000
25%        16.000000
50%        49.000000
75%       142.000000
max       373.000000
Name: Recency, dtype: float64
count    4372.000000
mean        5.082571
std         9.361120
min         1.000000
25%         1.000000
50%         3.000000
75%         5.000000
max       249.000000
Name: Frequency, dtype: float64

Classify Transaction type

In [171]:
by_invoice['Transaction']=''

for index,row in by_invoice.iterrows():
    if str(row['InvoiceNo']).startswith("C"):
        by_invoice.loc[index,'Transaction'] = 'Cancel'
    elif str(row['InvoiceNo']).startswith("A"):
        by_invoice.loc[index,'Transaction'] = 'Adjust'
    else:
        by_invoice.loc[index,'Transaction'] = 'Purchase'
In [172]:
by_invoice.sample(10)
Out[172]:
InvoiceNo InvoiceDate Country CustomerID Recency UnitPrice Description Quantity TotalPrice QuantityTotal Description_as_string Transaction
11013 563921 2011-08-21 14:00:00 United Kingdom 16407.0 109.0 [7.95, 7.95] [DOORMAT NEW ENGLAND, DOORMAT FAIRY CAKE] [1, 1] 15.90 2 DOORMAT FAIRY CAKE-DOORMAT NEW ENGLAND Purchase
14772 572886 2011-10-26 13:46:00 Spain 12448.0 43.0 [1.65, 1.25, 65.0, 0.19, 0.39, 1.25, 0.55, 8.2... [PLASTERS IN TIN VINTAGE PAISLEY , PACK OF 6 B... [12, 12, 1, 24, 12, 24, 24, 2, 24, 12, 12, 12,... 449.45 243 20 DOLLY PEGS RETROSPOT-ASS FLORAL PRINT MULTI... Purchase
12014 566308 2011-09-12 09:38:00 United Kingdom 15865.0 88.0 [4.95, 12.75, 12.75, 8.5, 8.5, 7.95, 8.5, 2.89... [SET OF 3 REGENCY CAKE TINS, REGENCY CAKESTAND... [8, 2, 2, 2, 2, 2, 2, 6, 12, 12, 12, 12, 12, 2... 289.92 134 BUNDLE OF 3 ALPHABET EXERCISE BOOKS-BUNDLE OF ... Purchase
19 536386 2010-12-01 09:57:00 United Kingdom 16029.0 373.0 [4.95, 1.65, 1.65] [WHITE WIRE EGG HOLDER, JUMBO BAG BAROQUE BLA... [36, 100, 100] 508.20 236 JUMBO BAG BAROQUE BLACK WHITE-JUMBO BAG RED R... Purchase
11132 564172 2011-08-23 14:19:00 United Kingdom 16033.0 107.0 [7.95, 7.95, 0.42, 4.25, 5.95, 5.95, 2.95, 4.9... [DOORMAT WELCOME SUNRISE, DOORMAT UNION FLAG, ... [3, 2, 25, 4, 2, 2, 2, 4, 12, 4, 6, 24, 8, 4, ... 630.30 276 20 DOLLY PEGS RETROSPOT-3 TRADITIONAl BISCUIT ... Purchase
8059 556725 2011-06-14 11:28:00 United Kingdom 18102.0 178.0 [12.48] [SET 3 WICKER OVAL BASKETS W LIDS] [200] 2496.00 200 SET 3 WICKER OVAL BASKETS W LIDS Purchase
11999 566293 2011-09-11 15:35:00 United Kingdom 14533.0 88.0 [0.29, 3.39, 1.69, 0.85, 2.08, 1.65, 0.79, 0.0... [12 PENCILS TALL TUBE POSY, SET 6 SCHOOL MILK ... [96, 24, 12, 12, 12, 24, 20, 96, 100, 24, 25, ... 408.38 555 12 PENCILS TALL TUBE POSY-6 GIFT TAGS 50'S CHR... Purchase
13198 569203 2011-10-02 10:32:00 United Kingdom 16353.0 68.0 [4.95, 1.25] [CHILLI LIGHTS, RED RETROSPOT OVEN GLOVE ] [48, 20] 262.60 68 CHILLI LIGHTS-RED RETROSPOT OVEN GLOVE Purchase
3548 545587 2011-03-04 09:46:00 United Kingdom 14796.0 280.0 [1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.25, 1.2... [POTTING SHED TEA MUG, IF YOU CAN'T STAND THE ... [2, 1, 1, 2, 2, 2, 1, 3, 2, 12, 12, 12, 12, 12... 281.49 221 60 TEATIME FAIRY CAKE CASES-AREA PATROLLED MET... Purchase
8182 557023 2011-06-16 13:04:00 United Kingdom 16225.0 175.0 [2.95, 3.75, 4.95, 4.95, 3.75, 2.95, 5.95, 3.7... [PINK OWL SOFT TOY, MOUSE TOY WITH PINK T-SHIR... [6, 4, 9, 6, 4, 6, 2, 4, 12] 183.95 53 BAKING SET 9 PIECE RETROSPOT -BAKING SET SPACE... Purchase

Now create a word cloud from the Descriptions feature of the data set. This will highlight the most common words found in each line item of descriptions of the order. Alarm, Clock, Bakelike, Red and Children are some of the most common words in the description field.

In [173]:
# Code from stackoverflow: https://stackoverflow.com/questions/16645799/how-to-create-a-word-cloud-from-a-corpus-in-python
stopwords = set(STOPWORDS)
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(16, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    
show_wordcloud(marketing_data['Description'])

Modeling and Evaluation

Cluster Analysis --K Means on RFM Customer Behavior

What is an RFM models and why use it?

RFM stands for Recency, Frequency, & Monetary Model which is a classic business tool that’s benefited the marketing and sales departments of high customer volume companies. The basic question proposed in these departments is: Who is your “best” customer. And here lies the problem of determining which customers qualify as “best” since it’s such a subjective word. Some businesses value loyalty over high sales since they believe consistency and predictability is best for them. Others believe only in high volume sales even if it was seasonal, and so on.

The RFM Model comes from a variation of previous models that attempt to segment the business’s customer base or potential customer base quantitatively using the three variables scoring methodology. This might seem intuitive, but as we discuss in the Deployment Section, it’s not often implemented as we’d think.

Further examination from the predictive analytics software company, Canopy Labs excerpt:

1) Recency: When was the last time a customer made a purchase order with your business? According to the RFM model, a customer who has recently interacted with your store is more inclined to accept another interaction that you initiate.

2) Frequency: How regularly does this customer make a purchase with your business? Over time, the frequency of a purchase can, in most cases, predict the likelihood and schedule of future purchases.

3) Monetary Value: How much money does this customer spend over a period of time? Depending on what makes more sense for your company, you might decide to calculate monetary value as revenue or profitability. Either way, this input adds monetary spend into the equation to calculate your RFM.

Source: https://canopylabs.com/resources/an-introduction-to-the-rfm-model/

Coding Source: https://towardsdatascience.com/find-your-best-customers-with-customer-segmentation-in-python-61d602f9eee6

In [174]:
#Dataset used with new appended fields
#Dataset accounts for cancelled orders. 
by_customer.head() 
Out[174]:
CustomerID InvoiceNo InvoiceDate Recency Description Quantity TotalPrice Frequency Monetary QuantityTotal
0 12346.0 [541431, C541433] [2011-01-18 10:01:00, 2011-01-18 10:17:00] 325.0 [[MEDIUM CERAMIC TOP STORAGE JAR], [MEDIUM CER... [74215, -74215] [77183.6, -77183.6] 2 0.00 0
1 12347.0 [537626, 542237, 549222, 556201, 562032, 57351... [2010-12-07 14:57:00, 2011-01-26 14:30:00, 201... 1.0 [[BLACK CANDELABRA T-LIGHT HOLDER, AIRLINE BAG... [12, 4, 12, 36, 12, 12, 12, 12, 12, 12, 4, 4, ... [711.79, 475.39, 636.25, 382.52, 584.91, 1294.... 7 4310.00 2458
2 12348.0 [539318, 541998, 548955, 568172] [2010-12-16 19:09:00, 2011-01-25 10:42:00, 201... 74.0 [[72 SWEETHEART FAIRY CAKE CASES, 60 CAKE CASE... [72, 72, 24, 120, 24, 120, 72, 144, 144, 48, 7... [892.8000000000001, 227.43999999999997, 367.0,... 4 1797.24 2341
3 12349.0 [577609] [2011-11-21 09:51:00] 18.0 [[PARISIENNE CURIO CABINET, SWEETHEART WALL TI... [2, 2, 6, 3, 6, 6, 12, 2, 12, 2, 12, 12, 6, 3,... [1757.55] 1 1757.55 631
4 12350.0 [543037] [2011-02-02 16:01:00] 309.0 [[CHOCOLATE THIS WAY METAL SIGN, METAL SIGN NE... [12, 12, 10, 12, 24, 6, 12, 12, 12, 12, 12, 12... [334.40000000000003] 1 334.40 197
In [175]:
# Duplicated the dataframe to avoid accidental manipulation. 
maa = by_customer.copy()

#Isolated the variables that are relevant to the analysis
mad = maa[['CustomerID', 'Recency', 'Frequency', 'Monetary']]
mad.set_index('CustomerID',inplace=True)

mad.head()



#mad.to_excel('C:/Users/Bahr/Desktop/dakas.xlsx')  #Testing dataset accuracy 

#maa.to_excel('C:/Users/Bahr/Desktop/dakasss.xlsx')  #Testing dataset accuracy 

#The following table is an RFM table 
Out[175]:
Recency Frequency Monetary
CustomerID
12346.0 325.0 2 0.00
12347.0 1.0 7 4310.00
12348.0 74.0 4 1797.24
12349.0 18.0 1 1757.55
12350.0 309.0 1 334.40
In [176]:
#Create customer segments from RFM model using Quartiles 
#Here we want to create an RFM score for each customer to easily identify the customer's behavior. 
#The score is from 1-4 but can be 1-5, etc. 

quant = mad.quantile(q=[0.25,0.5,0.75])
quant
Out[176]:
Recency Frequency Monetary
0.25 16.0 1.0 291.795
0.50 49.0 3.0 644.070
0.75 142.0 5.0 1608.335
In [177]:
quant.to_dict()
Out[177]:
{'Recency': {0.25: 16.0, 0.5: 49.0, 0.75: 142.0},
 'Frequency': {0.25: 1.0, 0.5: 3.0, 0.75: 5.0},
 'Monetary': {0.25: 291.79499999999996,
  0.5: 644.0699999999999,
  0.75: 1608.335}}
In [178]:
#Here we are creating the RFM segmentation rankings. 
#The score is from 1-4 but can be 1-5, etc. 
#A rank of 4 will be the highest/best. A rank of 1 will be the worst/lowest. 
#We have two arguments because in case of recency, the lower the number (days) the better.



#Arguments (x= value, p = recency, monetary_value, frequency, d = quartiles dict)
def Rscore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

#Arguments (x= value, p = recency, monetary_value, frequency, d = quartiles dict)  
def FMscore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
In [179]:
#Create RFM segmentation table
segmentation = mad.copy()
segmentation['R_Quartile'] = segmentation['Recency'].apply(Rscore, args=('Recency',quant))
segmentation['F_Quartile'] = segmentation['Frequency'].apply(FMscore, args=('Frequency',quant))
segmentation['M_Quartile'] = segmentation['Monetary'].apply(FMscore, args=('Monetary',quant))
print(segmentation[segmentation['Frequency'].isna()])
Empty DataFrame
Columns: [Recency, Frequency, Monetary, R_Quartile, F_Quartile, M_Quartile]
Index: []

Now each customer is assigned in a quartile.

In [180]:
segmentation.head()
Out[180]:
Recency Frequency Monetary R_Quartile F_Quartile M_Quartile
CustomerID
12346.0 325.0 2 0.00 1 2 1
12347.0 1.0 7 4310.00 4 4 4
12348.0 74.0 4 1797.24 2 3 4
12349.0 18.0 1 1757.55 3 1 4
12350.0 309.0 1 334.40 1 1 2
In [181]:
#adding RFM score for each customer

segmentation['RFM_Score'] = segmentation.R_Quartile.map(str) \
                            + segmentation.F_Quartile.map(str) \
                            + segmentation.M_Quartile.map(str)
segmentation.head()

print(segmentation[segmentation['Frequency'].isna()])
Empty DataFrame
Columns: [Recency, Frequency, Monetary, R_Quartile, F_Quartile, M_Quartile, RFM_Score]
Index: []
In [182]:
#Filtering for the "best" Customers.

segmentation[segmentation['RFM_Score']=='444'].head(10)
Out[182]:
Recency Frequency Monetary R_Quartile F_Quartile M_Quartile RFM_Score
CustomerID
12347.0 1.0 7 4310.00 4 4 4 444
12359.0 7.0 6 6182.98 4 4 4 444
12362.0 2.0 13 5154.58 4 4 4 444
12381.0 4.0 6 1803.96 4 4 4 444
12388.0 15.0 6 2780.66 4 4 4 444
12395.0 15.0 15 2998.28 4 4 4 444
12417.0 2.0 12 3578.80 4 4 4 444
12423.0 0.0 9 1849.11 4 4 4 444
12433.0 0.0 7 13375.87 4 4 4 444
12437.0 1.0 19 4896.66 4 4 4 444
In [183]:
#Here we will assign the scores the appropriate segments and quantify customers that fall in those groups. 

display(Image('https://i.imgur.com/vEUDYTc.png', width=700, unconfined=True))

#Source: https://www.blastam.com/blog/rfm-analysis-boosts-sales

Earlier we suggested that a score of 4 was the best in each feature. This is because some analysts add the score from the three features and rank the customers based on the total. Example: a total score of 12 ranks the customers the highest. This simplification is made for the sales field people who don’t want to drill down into the data. In the chart above, the best customers have the lowest total score but, in this case, their purpose wasn’t to add up the score. Their intention is to use a 3-digit representation as a score. We will stick with the "444" as representing the best customer.

In [184]:
#Manual assigning of customer groups trial 1. 

print("Best Customers: ",len(segmentation[segmentation['RFM_Score']=='444']))
print('Loyal Customers: ',len(segmentation[segmentation['F_Quartile']==4]))
print("Big Spenders: ",len(segmentation[segmentation['M_Quartile']==4]))
print('Almost Lost: ', len(segmentation[segmentation['RFM_Score']=='244']))
print('Lost Customers: ',len(segmentation[segmentation['RFM_Score']=='144']))
print('Lost Cheap Customers: ',len(segmentation[segmentation['RFM_Score']=='111']))
Best Customers:  496
Loyal Customers:  1089
Big Spenders:  1093
Almost Lost:  91
Lost Customers:  15
Lost Cheap Customers:  404
In [185]:
#Manual assigning of customer groups trial 2.

print("Best Customers: ",len(segmentation[segmentation['RFM_Score']=='444']))
print('Growth Customers: ', len(segmentation[segmentation['RFM_Score']=='333']))
print('Tipping point Customers: ',len(segmentation[segmentation['RFM_Score']=='222']))
print('Lost Customers: ',len(segmentation[segmentation['RFM_Score']=='111']))
Best Customers:  496
Growth Customers:  120
Tipping point Customers:  146
Lost Customers:  404
In [186]:
#Manual assigning of customer groups trial 3. 

upper_list = ['444','433','434','333','344','343','443','334']
lower_list = ['222','211','212','111','122','121','112','221']


#len([k for k in temp if segmentation['RFM_Score'] in k])

print("Valued Customers: ",len( segmentation[segmentation['RFM_Score'].isin(upper_list) ])) 
print('Normal Customers: ',len(segmentation[segmentation['RFM_Score'].isin(lower_list)])) 
Valued Customers:  1227
Normal Customers:  1383
In [187]:
#Using K-Means Clustering on RFM Variables
#Create new dataframe to avoid complications. 

rfm_data = segmentation.drop(['R_Quartile','F_Quartile','M_Quartile','RFM_Score'],axis=1)
rfm_data.head()
print(rfm_data[rfm_data['Frequency'].isna()])
Empty DataFrame
Columns: [Recency, Frequency, Monetary]
Index: []
In [188]:
rfm_data.corr() #Feature Correlations
Out[188]:
Recency Frequency Monetary
Recency 1.000000 -0.259103 -0.131704
Frequency -0.259103 1.000000 0.565034
Monetary -0.131704 0.565034 1.000000
In [189]:
#From the heatmap and from the above feature correlation numbers we see that Frequency and Monetary are correlated best. 
#We will explore this relationship. 

sns.heatmap(rfm_data.corr(),cmap="Greens");
In [190]:
#Visualizing Feature Distributions
#From the visualization, we see that the variables are skewed thus we need to normalize.  


scatter_matrix(mad, alpha = 0.3, figsize = (15,8), diagonal = 'kde')
scatter_matrix(rfm_data, alpha = 0.3, figsize = (15,8), diagonal = 'kde');
In [191]:
#Data normalization using log. 
#Here we had a slight problem with NAN values. We replaced NAN with Mean. 

rfm_r_log = np.log(rfm_data['Recency']+.01) 
rfm_f_log = np.log(rfm_data['Frequency']+.01)
rfm_m_log = np.log(rfm_data['Monetary']+.01)
C:\users\mtool\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: RuntimeWarning:

invalid value encountered in log

In [192]:
log_data = pd.DataFrame({'Monetary': rfm_m_log, 'Recency': rfm_r_log, 'Frequency': rfm_f_log})
log_data.head()
Out[192]:
Monetary Recency Frequency
CustomerID
12346.0 -4.605170 5.783856 0.698135
12347.0 8.368696 0.009950 1.947338
12348.0 7.494013 4.304200 1.388791
12349.0 7.471682 2.890927 0.009950
12350.0 5.812368 5.733374 0.009950
In [193]:
#Created another data frame to test NAN and infinity issues. 

test = log_data

test = test.fillna(test.mean())

test.head()
Out[193]:
Monetary Recency Frequency
CustomerID
12346.0 -4.605170 5.783856 0.698135
12347.0 8.368696 0.009950 1.947338
12348.0 7.494013 4.304200 1.388791
12349.0 7.471682 2.890927 0.009950
12350.0 5.812368 5.733374 0.009950
In [194]:
np.any(np.isnan(test))
Out[194]:
False
In [195]:
#Visualized result after normalization. Results look much less skewed. 

scatter_matrix(test, alpha = 0.2, figsize = (14,7), diagonal = 'kde');
In [196]:
test.corr()
Out[196]:
Monetary Recency Frequency
Monetary 1.000000 -0.411606 0.730376
Recency -0.411606 1.000000 -0.531992
Frequency 0.730376 -0.531992 1.000000
In [197]:
sns.heatmap(test.corr(),cmap="Greens");

Normalization have made the Monetary and Frequency even more correlated.

K-Means Model

In [198]:
#K-Means Implementation    ---best score is 2 clusters 

matrix = test.values
for n_clusters in range(2,10):
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=100)
    kmeans.fit(matrix)
    clusters = kmeans.predict(matrix)
    silhouette_avg = silhouette_score(matrix, clusters)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
For n_clusters = 2 The average silhouette_score is : 0.42287524043877595
For n_clusters = 3 The average silhouette_score is : 0.41611532436708487
For n_clusters = 4 The average silhouette_score is : 0.32070110830229115
For n_clusters = 5 The average silhouette_score is : 0.3386276470028391
For n_clusters = 6 The average silhouette_score is : 0.35074277337879867
For n_clusters = 7 The average silhouette_score is : 0.3333836739871028
For n_clusters = 8 The average silhouette_score is : 0.3203379332706616
For n_clusters = 9 The average silhouette_score is : 0.31635626792677446
In [199]:
n_clusters = 2 # we will use this initially to visualize low value (F) and high value customers
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette Score: {:<.3f}'.format(silhouette_avg))
silhouette Score: 0.423
In [200]:
#create a scatter plot identifying high value and low value/average customers. 
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
In [201]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
Out[201]:
0 1
Number Customers 2910 1462

Overall the clustering of these two customer segments seems close to expected. Of course we'll see overlap due to the dynamic behavioral nature of customers.

Next we will plot 6 customer groups to see what it looks like.

In [202]:
n_clusters = 6  # 6 Clusters for potentially the 6 customer groups we've assigned earlier. 
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette score: {:<.3f}'.format(silhouette_avg))
silhouette score: 0.351
In [203]:
#create a scatter plot identifying customers in increasing value. 
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
In [204]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
Out[204]:
3 0 2 1 5 4
Number Customers 1512 1180 825 736 105 14
In [205]:
n_clusters = 4  # we will use 
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(matrix)
clusters_customers = kmeans.predict(matrix)
silhouette_avg = silhouette_score(matrix, clusters_customers)
print('silhouette Score: {:<.3f}'.format(silhouette_avg))
silhouette Score: 0.321
In [206]:
#create a scatter plot identifying customers in increasing value. 
plt.scatter(matrix[:, 0], matrix[:, 2], c=clusters_customers, s=50, cmap='plasma')
#select cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 2], c='black', s=200, alpha=0.5)
plt.xlabel('Monetary', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.show()
In [207]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(clusters_customers).value_counts(), columns = ['Number Customers']).T
Out[207]:
3 1 0 2
Number Customers 1737 1654 876 105

3D K-means Clustering

In [208]:
import plotly
import plotly.graph_objs as go
def getTrace(x, y, z, c, label, s=2):
    trace_points = go.Scatter3d(
    x=x, y=y, z=z,
    mode='markers',
    marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1),
    name=label)
    return trace_points;
In [209]:
def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces):
    layout = go.Layout(
    title=title,
    scene = dict(
    xaxis=dict(title=x_colname, range = x_range),
    yaxis=dict(title=y_colname, range = y_range),
    zaxis=dict(title=z_colname, range = z_range)
    )
    )
    
    figgg = go.Figure(data=traces, layout=layout)
    plotly.offline.plot(figgg)
In [210]:
#Coding Source: http://www.semspirit.com/artificial-intelligence/machine-learning/clustering/k-means-clustering/k-means-clustering-in-python/


from sklearn.cluster import KMeans
n_opt_clusters = 2
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
In [211]:
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Valued Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Average Customers') 
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: ') 
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: ') 
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: ') 



x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Two Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])

Since the output is an HTML file, we have screenshot the plots. We will visualize segmentation based on 2, 3, and 5 centroids (previous 2ds plot showed 6 centriod/segments but we think that talking about 15 customers is not worth the squeeze.

In [212]:
display(Image('https://i.imgur.com/PORUzFF.jpg', width=950, unconfined=True))
In [213]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Out[213]:
1 0
Number Customers 2910 1462

Ramifications: Having a subject matter expert in the group is definitely an advantage in analyzing these clusters. We started out with 2d plots of K-Means but were encouraged to get a more three-dimensional looks at the clustering output due to a suspicion that it will greatly increase insight. And it did. Were we produced two clusters labeled Valued Customer and Average Customer. The reason we chose two clusters is because the Silhouette Score as the highest. However, as we see, there is a group on the right that visually doesn’t look like it belongs.

Regardless, the numbers generated above identifying how many customers are in the first group and in the second group is not exactly in line with our manual RFM assignment in Trial 3. There might be variability in that comparison since Trial 3 had a hard cut off between the Valuable customers and the Average customers. Hence, the manual assignment of customers in Trial 3 omitted the spectrum of customers in between which is strategic in identifying the strongest customers, but only that. We however feel that the clustering amount here does not provide as much insight as we’d like. Thus, we will increase the clusters moving forward.

In [214]:
n_opt_clusters = 3
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
In [215]:
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Valued Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Average Customers') 
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: Irregular Customers') 
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: ') 
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: ') 



x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Three Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])
In [216]:
display(Image('https://i.imgur.com/H4E6SfS.jpg', width=950, unconfined=True))
In [217]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Out[217]:
1 0 2
Number Customers 2506 1761 105

Ramifications: The Silhouette score in the 3-cluster group was .01 off from the 2 cluster group. So we’ll treat this result as theoretically the best clustering output. We finally see a distinction in that third cluster on the side that shouldn’t belong with the two central clusters. We’ve identified that group as the Irregular customers group. That customer group is tricky to pin down since it’s elongated and has everything from high monetary and recency to low monetary and low recency. The frequency remains the same. Which explains why in the 2d plots we will see later, the monetary vs frequency plots look much better than the monetary vs recency plots. That’s because there is so much more overlap since customer behavior is a spectrum.

In [218]:
n_opt_clusters = 5
kmeans = KMeans( init = 'k-means++', n_clusters = n_opt_clusters, max_iter=1000, n_init = 30, random_state=0)
kmeans.fit(matrix)
y_kmeans = kmeans.predict(matrix)
In [219]:
#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 25, c = 'yellow', label='Centroids')
t1 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 5, c='red', label = '0: Growth Customers')
t2 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 5, c='pink', label = '1: Tipping Point Customers') 
t3 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 5, c='blue', label = '2: Best Customers') 
t4 = getTrace(matrix[y_kmeans == 3, 0], matrix[y_kmeans == 3, 1], matrix[y_kmeans == 3, 2], s= 5, c='green', label = '3: Irregular Customers') 
t5 = getTrace(matrix[y_kmeans == 4, 0], matrix[y_kmeans == 4, 1], matrix[y_kmeans == 4, 2], s= 5, c='cyan', label = '4: Loyal Customers') 



x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Five Customer Segmentations", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4, t5, centroids])
In [220]:
display(Image('https://i.imgur.com/nJyoWo2.jpg', width=950, unconfined=True))
In [221]:
display(Image('https://i.imgur.com/2OChQPZ.jpg', width=950, unconfined=True))
In [222]:
# What's the number of customers in each cluster?
pd.DataFrame(pd.Series(y_kmeans).value_counts(), columns = ['Number Customers']).T
Out[222]:
0 4 1 2 3
Number Customers 1455 1221 824 767 105

Ramifications: Our subject matter expert agrees that visually this is the best clustering graph yet. We tried 6 and 7 clusters but the insights derived from those reached a plateau. The customer groups we have here are the following:

Clustering output

Red: Growth Customers: 1455

Pink: Tipping Point Customers: 824

Blue: Best Customers: 767

Green: Irregular Customers: 105

Cyan: Loyal Customers: 1221

Looking at the first “Five Customer Segmentation” plot, we can clearly see how the Blue group, the Best Customers, have the highest monetary, recency, and frequency values (frequency can be better seen in the second screenshot). The Cyan group, the Loyal Customers were assigned that category because as we can see, they might not buy the most, or the most recent, but they are the highest overall group in terms of frequency. They are closer to us visually on the z axis. Next, we have the Red group, the Growth customers. They have been labeled that way, because unlike the Loyal customers, they lack in monetary and recency, but just behind them. This is a big opportunity for the sales folks, the middle customer segments that’s often ignored in business.

Next, we look at the second screenshot for the “Five Customer Segmentation.” Here we can see the frequency axis much better. And we immediately notice that the Green group, as mentioned previously are the worst customers in terms of frequency of sales. However, they are clustered in this elongated shape that makes them anywhere from big spenders to small spenders. Rightfully so, they are an Irregular customer group. And according to the 80/20 law, not worth the squeeze.

Looking at our RFM customer segmentation assignments in Trial 1 and 2, we feel these numbers out performed them due to the inclusive spectral nature of clustering.

Hierarchical Clustering

Here, we perform a different type of clustering "Hierarchical Agglomerative Clustering, HAC". The algorithm initially treats each point as a single cluster. Then merges two items at a time into a new cluster. How the pairs merge involves calculating a dissimilarity between each merged pair and the other samples. There are many ways to do this, we used "Complete", "Ward" and "Average" linkage methods. The pairing process continues until all items merge into a single cluster.

Here is the definition of each methods used:

  • Complete linkage: similarity of the furthest pair. One drawback is that outliers can cause merging of close groups later than is optimal.
  • Ward linkage: This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.
  • Average linkage: The distance is calculated as the average distance of every point between the two clusters.

The distance method which is used in our clustering is Euclidean method.

We used the average silhouette approach for finding the number of clusters. Briefly, it measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. Silhouette scores can range from -1 to 1, with 1 being the best.

Average silhouette method computes the average silhouette of observations for different values of n. The optimal number of clusters n is the one that maximize the average silhouette over a range of possible values for n (Kaufman and Rousseeuw [1990]).

In [223]:
%%time 


matrix = test.values
params = []
for link in ['ward', 'complete', 'average']:
    for n in range(2,20):

        # append on the clustering
        hc = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage=link)
        hc.fit(matrix)
        newfeature_hc = hc.labels_ 
        clusters_hc=hc.fit_predict(matrix)
        hc_silhouette = silhouette_score(matrix, newfeature_hc)
       #print(hc.labels_)
        print("C=", n, "Link type=", link , "Avg Silhouette_score is:", hc_silhouette)
C= 2 Link type= ward Avg Silhouette_score is: 0.31646149908521237
C= 3 Link type= ward Avg Silhouette_score is: 0.3504890963195237
C= 4 Link type= ward Avg Silhouette_score is: 0.28862818844036847
C= 5 Link type= ward Avg Silhouette_score is: 0.29947844620494746
C= 6 Link type= ward Avg Silhouette_score is: 0.29714648088026535
C= 7 Link type= ward Avg Silhouette_score is: 0.28851599369199177
C= 8 Link type= ward Avg Silhouette_score is: 0.27323495686383925
C= 9 Link type= ward Avg Silhouette_score is: 0.24242500474845755
C= 10 Link type= ward Avg Silhouette_score is: 0.24543794547008244
C= 11 Link type= ward Avg Silhouette_score is: 0.25317542697463685
C= 12 Link type= ward Avg Silhouette_score is: 0.2492939286601918
C= 13 Link type= ward Avg Silhouette_score is: 0.2542277930074384
C= 14 Link type= ward Avg Silhouette_score is: 0.2508658845081853
C= 15 Link type= ward Avg Silhouette_score is: 0.24507906371091337
C= 16 Link type= ward Avg Silhouette_score is: 0.23928169132568686
C= 17 Link type= ward Avg Silhouette_score is: 0.23908785060998372
C= 18 Link type= ward Avg Silhouette_score is: 0.23946998804992561
C= 19 Link type= ward Avg Silhouette_score is: 0.22853477532079833
C= 2 Link type= complete Avg Silhouette_score is: 0.45002096106462913
C= 3 Link type= complete Avg Silhouette_score is: 0.4574080067888762
C= 4 Link type= complete Avg Silhouette_score is: 0.42415433903449373
C= 5 Link type= complete Avg Silhouette_score is: 0.3851377174247838
C= 6 Link type= complete Avg Silhouette_score is: 0.18414628990770548
C= 7 Link type= complete Avg Silhouette_score is: 0.2399468299337497
C= 8 Link type= complete Avg Silhouette_score is: 0.2399960673866716
C= 9 Link type= complete Avg Silhouette_score is: 0.2324843522685543
C= 10 Link type= complete Avg Silhouette_score is: 0.23193381991112183
C= 11 Link type= complete Avg Silhouette_score is: 0.21220176110464042
C= 12 Link type= complete Avg Silhouette_score is: 0.20429131554643695
C= 13 Link type= complete Avg Silhouette_score is: 0.19410377263894882
C= 14 Link type= complete Avg Silhouette_score is: 0.17774228061872396
C= 15 Link type= complete Avg Silhouette_score is: 0.1768085274440476
C= 16 Link type= complete Avg Silhouette_score is: 0.17928509592490036
C= 17 Link type= complete Avg Silhouette_score is: 0.16173370712863439
C= 18 Link type= complete Avg Silhouette_score is: 0.16045378470246632
C= 19 Link type= complete Avg Silhouette_score is: 0.16564529181859017
C= 2 Link type= average Avg Silhouette_score is: 0.7460590249166221
C= 3 Link type= average Avg Silhouette_score is: 0.6847693930676568
C= 4 Link type= average Avg Silhouette_score is: 0.682257451703548
C= 5 Link type= average Avg Silhouette_score is: 0.5048490541460189
C= 6 Link type= average Avg Silhouette_score is: 0.4015345344114883
C= 7 Link type= average Avg Silhouette_score is: 0.2977449241049238
C= 8 Link type= average Avg Silhouette_score is: 0.29742429605804027
C= 9 Link type= average Avg Silhouette_score is: 0.28222545951488753
C= 10 Link type= average Avg Silhouette_score is: 0.2791044406403914
C= 11 Link type= average Avg Silhouette_score is: 0.3205991728776747
C= 12 Link type= average Avg Silhouette_score is: 0.30402581216130947
C= 13 Link type= average Avg Silhouette_score is: 0.23146527520401974
C= 14 Link type= average Avg Silhouette_score is: 0.23007836914636834
C= 15 Link type= average Avg Silhouette_score is: 0.1686119327155415
C= 16 Link type= average Avg Silhouette_score is: 0.16659065126542683
C= 17 Link type= average Avg Silhouette_score is: 0.16339444838477468
C= 18 Link type= average Avg Silhouette_score is: 0.1273446103540149
C= 19 Link type= average Avg Silhouette_score is: 0.10389160785759696
Wall time: 1min 51s

Here, the maximum Silhouette scores are belong to the average linkage method as below:

  • n=2 with Silhouette score=0.75
  • n=3 with Silhouette score=0.68
  • n=4 with Silhouette score=0.68

Here, the maximum Silhouette scores belong to the average linkage method which is 0.75 for number of clusters = 2. The clustering for n=3 and 4 are fairly close to the optimal score with a silhouette score of .68.

It worth to improve our HAC by looking at the dendrogram graph which helps to find the optimum number of clusters.

For our dendrogram graph, we used the ward linkage method, which results in a more acceptable graph. (we tried it with other different methods and at the end we decided to continue with ward linkage method.)

In [224]:
# Create Dendrogram graph:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(25,15))
plt.title("Customer Dendograms")  
dend = shc.dendrogram(shc.linkage(matrix, method='ward')) 

By looking at the Dendrogram graph, we can pick a distance where the branches are longer and draw a cut off from there. No cutting is wrong, and there is no definite procedure to do that. Several cutoffs might be equally good from a theoretical point of view. Due to this flexibility with how to cut the dendrogram we will be using visual queues of plotting the data with the different size groups to see which size of clusters provides the most insight.

By considering 60, 80 and 120 as the distance, we will have 4, 3 and 2 clusters. Next, we will inspect the number of clusters visually by scatter plots and will find out the number of clusters.

Here, we visually inspect the number of clusters with scatter plots for average method and number of clusters equal to 2, 3 and 4. We would like to find the optimal number of clusters.

In [225]:
data = matrix

# Scatter plots for n=2 and linkage=average
cls = AgglomerativeClustering(n_clusters=2, linkage='average')
cls.fit(data)
hac_labels_2 = cls.labels_ 
# Scatter plots for n=3 and linkage=average
cls = AgglomerativeClustering(n_clusters=3, linkage='average')
cls.fit(data)
hac_labels_3 = cls.labels_ 
# Scatter plots for n=4 and linkage=average
cls = AgglomerativeClustering(n_clusters=4, linkage='average')
cls.fit(data)
hac_labels_4 = cls.labels_ 

fig = plt.figure(figsize=(12,8))
title = ['Clusters=2','Clusters=3','Clusters=4']

for i,l in enumerate([hac_labels_2,hac_labels_3, hac_labels_4]):
    
    plt.subplot(3,2,2*i+1)
    plt.scatter(matrix[:, 0], matrix[:, 2]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=10, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Frequency')
    plt.grid()
    plt.title(title[i])

    plt.subplot(3,2,2*i+2)
    plt.scatter(matrix[:, 0], matrix[:, 1]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=10, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Recency')
    plt.grid()
    plt.title(title[i])

plt.tight_layout()
plt.show()

By reviewing the above graphs, customer segmentation in clusters equal to 3 is the better one, it illustrates the better division. But, still we do not find the average linkage method as our preferred method in this case as we do not see a good customer segmentation in this method. We may have a better segmentation in ward method.

In the next step, we will switch the linkage method to ward and will visualize the number of clusters in the range of 2 and 6. We will see if we can get a better customer segmentation.

In [226]:
# Scatter plots for n=2 and linkage=ward
cls = AgglomerativeClustering(n_clusters=2, linkage='ward')
cls.fit(data)
hac_labels_2 = cls.labels_ 
# Scatter plots for n=3 and linkage=ward
cls = AgglomerativeClustering(n_clusters=3, linkage='ward')
cls.fit(data)
hac_labels_3 = cls.labels_ 
# Scatter plots for n=4 and linkage=ward
cls = AgglomerativeClustering(n_clusters=4, linkage='ward')
cls.fit(data)
hac_labels_4 = cls.labels_ 
# Scatter plots for n=5 and linkage=ward
cls = AgglomerativeClustering(n_clusters=5, linkage='ward')
cls.fit(data)
hac_labels_5 = cls.labels_ 
# Scatter plots for n=6 and linkage=ward
cls = AgglomerativeClustering(n_clusters=6, linkage='ward')
cls.fit(data)
hac_labels_6 = cls.labels_ 

fig = plt.figure(figsize=(12,20))
title = ['Clusters=2','Clusters=3','Clusters=4', 'Clusters=5','Clusters=6']

for i,l in enumerate([hac_labels_2,hac_labels_3, hac_labels_4, hac_labels_5, hac_labels_6]):
    
    plt.subplot(6,2,2*i+1)
    plt.scatter(matrix[:, 0], matrix[:, 2]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=20, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Frequency')
    plt.grid()
    plt.title(title[i])

    plt.subplot(6,2,2*i+2)
    plt.scatter(matrix[:, 0], matrix[:, 1]+np.random.random(matrix[:, 1].shape)/2, c=l, cmap='plasma', s=20, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Recency')
    plt.grid()
    plt.title(title[i])

plt.tight_layout()
plt.show()

We can see good segmentation with 3 clusters and see superior segmentation using the ward method. The outliers on the monetary axis are not considered until we start to get 6 clusters, which did not appear to be an effective number of clusters based on our previous analysis with silhouette scores.

DBScan

In [227]:
colors = ['purple','orange','lightblue','red','gray','green','fuchsia','blue','violet','gold',
         'firebrick','olive','lime','navy','lightpink']

lbls = ['Valued Customers','Average Customers','Irregular Customers','forth','fifth','sixth','seventh']
In [228]:
db = DBSCAN(eps=0.33, min_samples=7).fit(data)
In [229]:
# core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_


print("Percent of outliers: ",str(round((len(labels[db.labels_==-1])/len(labels))*100,2)),"%")

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)



fig, axarr = plt.subplots(2, 1,figsize = (16,10))
plt.suptitle('DBScan plot of best estimate', size = 20)
for ind, x in enumerate(set(labels)):
    if x == -1:
        axarr[0].scatter(data[labels==-1,0],data[labels==-1, 2],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
    else:
        axarr[0].scatter(x=data[labels==x,0],y=data[labels==x, 2],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0)

# plt.legend()
# plt.show()

for ind, x in enumerate(set(labels)):
    if x == -1:
        axarr[1].scatter(data[labels==-1,0],data[labels==-1, 1],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
    else:
        axarr[1].scatter(data[labels==x,0],data[labels==x, 1],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0)
axarr.flat[0].set(xlabel='Monetary', ylabel='Frequency')
axarr.flat[1].set(xlabel='Monetary', ylabel='Recency')

plt.show()
Percent of outliers:  9.9 %
Estimated number of clusters: 10

Here we do an initial clustering with DBScan. We find that the initial model has very few outliers, but the number of clusters aren't aligning with previous observations. We will run a silhouette test to see the configuration that gives us the best score.

In [230]:
best_score = -1
best_eps = 0
best_samples = 0


for eps_n in np.linspace(.25,.75,10):
    for samples_n in np.linspace(15,50,20):
        db = DBSCAN(eps=eps_n, min_samples=samples_n).fit(data)
#         db = DBSCAN(eps=eps_n, min_samples=samples_n).fit_predict(data)
#         db.fit(matrix)
#         db.
        labels = db.labels_        
        n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
        #print(n_clusters_, eps_n, samples_n)
        if n_clusters_ > 2:
            silhouette_avg = round(silhouette_score(matrix, labels),2)
            if(silhouette_avg>best_score):
                best_score = silhouette_avg
                best_eps = eps_n
                best_samples = samples_n
                print("For n_clusters =", n_clusters_,  "The average silhouette_score is :", silhouette_avg)
                print("For eps =", round(eps_n,3),  "min_samples :", round(samples_n,3))
For n_clusters = 10 The average silhouette_score is : -0.28
For eps = 0.25 min_samples : 15.0
For n_clusters = 4 The average silhouette_score is : -0.24
For eps = 0.25 min_samples : 26.053
For n_clusters = 4 The average silhouette_score is : -0.12
For eps = 0.25 min_samples : 29.737
For n_clusters = 3 The average silhouette_score is : -0.07
For eps = 0.25 min_samples : 37.105
For n_clusters = 4 The average silhouette_score is : -0.02
For eps = 0.306 min_samples : 15.0
For n_clusters = 3 The average silhouette_score is : 0.07
For eps = 0.361 min_samples : 20.526
For n_clusters = 4 The average silhouette_score is : 0.19
For eps = 0.417 min_samples : 15.0
For n_clusters = 3 The average silhouette_score is : 0.21
For eps = 0.472 min_samples : 16.842
For n_clusters = 3 The average silhouette_score is : 0.22
For eps = 0.639 min_samples : 26.053
In [231]:
db = DBSCAN(eps=best_eps, min_samples=best_samples).fit(data)

labels = db.labels_


print("Percent of outliers: ",str(round((len(labels[db.labels_==-1])/len(labels))*100,2)),"%")

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)



fig, axarr = plt.subplots(2, 1,figsize = (16,10))
plt.suptitle('DBScan plot of best estimate', size = 20)
for ind, x in enumerate(set(labels)):
    if x == -1:
        axarr[0].scatter(data[labels==-1,0],data[labels==-1, 2],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
    else:
        axarr[0].scatter(x=data[labels==x,0],y=data[labels==x, 2],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0, label=lbls[ind])

# plt.legend()
# plt.show()

for ind, x in enumerate(set(labels)):
    if x == -1:
        axarr[1].scatter(data[labels==-1,0],data[labels==-1, 1],c='black', cmap=plt.cm.rainbow, s=5, linewidths=0, label='outlier')
    else:
        axarr[1].scatter(data[labels==x,0],data[labels==x, 1],c=colors[ind], cmap=plt.cm.rainbow, s=25, linewidths=0, label=lbls[ind])
axarr.flat[0].set(xlabel='Monetary', ylabel='Frequency')
axarr.flat[1].set(xlabel='Monetary', ylabel='Recency')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()
Percent of outliers:  6.7 %
Estimated number of clusters: 3

We find that after running the silhouette test that the best score gives us 3 clusters and reduces the percentage of outliers to just under 7%. This seems like a better fit, especially given that this is the optimal silhouetter score.

In [232]:
y_kmeans = db.labels_


#visualising the clusters
t1 = getTrace(matrix[y_kmeans == -1, 0], matrix[y_kmeans == -1, 1], matrix[y_kmeans == -1, 2], s= 2, c='black', label = '-1: Outliers')
t2 = getTrace(matrix[y_kmeans == 0, 0], matrix[y_kmeans == 0, 1], matrix[y_kmeans == 0, 2], s= 6, c='purple', label = '0: Valued Customers')
t3 = getTrace(matrix[y_kmeans == 1, 0], matrix[y_kmeans == 1, 1], matrix[y_kmeans == 1, 2], s= 6, c='orange', label = '1: Average Customers') 
t4 = getTrace(matrix[y_kmeans == 2, 0], matrix[y_kmeans == 2, 1], matrix[y_kmeans == 2, 2], s= 6, c='lightblue', label = '2: Irregular Customers') 




x=matrix[:,0]
y=matrix[:,1]
z=matrix[:,2]
showGraph("Three Customer Segmentations of DBScan", "Monetary", [min(x),max(x)], "Frequency", [min(y),max(y)], "Recency", [min(z)-1,max(z)], [t1,t2,t3,t4])
In [233]:
display(Image('https://i.imgur.com/FsVkTkb.png', width=950, unconfined=True))
In [234]:
# #combined clusters for all clustering methods (DBS, HAC, K-means).

data = matrix

cls = DBSCAN(eps=best_eps, min_samples=best_samples)
cls.fit(data)
dbs_labels = cls.labels_ 

cls = AgglomerativeClustering(n_clusters=3, linkage='ward')
cls.fit(data)
hac_labels = cls.labels_ 

cls = KMeans(n_clusters=3, random_state=1)
cls.fit(data)
kmn_labels = cls.labels_

fig = plt.figure(figsize=(12,8))
title = ['DBSCAN','HAC','KMEANS']

for i,l in enumerate([dbs_labels, hac_labels,kmn_labels]):
    
    plt.subplot(3,2,2*i+1)
    plt.scatter(data[:, 0], data[:, 2]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Frequency')
    plt.grid()
    plt.title(title[i])
    
    plt.subplot(3,2,2*i+2)
    plt.scatter(data[:, 0], data[:, 1]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
    plt.xlabel('Monetary'), plt.ylabel('Recency')
    plt.grid()
    plt.title(title[i])
    
    


plt.tight_layout()
plt.show()

Overall Ramifications Summary (with DBscan and HAC): These models allowed us to have the ability to systematically craft a highly targeted marketing campaigns for different customer groups. It aided in customer relationship building an customer conversion. Not just from a non-customer to a customer, but from one type to another. We can aid the sales team in strategically approaching the customer with the correct assumptions. Not to mention, decrease negative reactions from customer due to misidentified targeting. This model serves as a launch off for historical analysis and should not be used for prospects. We believe that all three models did similar clustering segmentation of customers due to the decreased variability of having 3 features. However, the HAC and DBScan actually identified customers with low or zero recency, moderate frequency, and moderate monetary, where as we didn’t see that with K-Means. We think HAC and DBScan both identified customers that can be targeted to purchase products, who otherwise haven't purchased anything recently.

Algorithm Summary

We used three different clustering algorithms to cluster the RFM data. We used the silhouette score to choose the best cluster size within each algorithm. First, we looked at the K-Means clustering algorithm. The silhouette score for 2 clusters is .422 and the score for 3 clusters was .416, a very small difference of .006. The scores for 4 clusters and above fell off after. When we did a visual inpsection of the charts of the clusters, the 3D charts showed good evidence for 3 clusters. The Hierarchical Agglomerative Clustering (HAC) algorithm silhouette score for 2 clusters is .746 and the score for 3 was .684. The dendogram also gave good evidence to support 3 clusters. Visual inpsection of the two charts for HAC there provided more evidence that 3 clusters describes our data very well. Lastly, we utilized the DB Scan algorithm. The best silhouette score using this algorithm was also 3 samples. These charts looked similar to the HAC charts. In each case the Monetary vs Recency charts clustered well going above 3 clusters, but when looking at these in conjunction with the Monetary vs. Frequency charts, three clusters appeared to group the data.

Noteworthy to mention is how the 3D chart for K-means and DB Scarn helped us understand how the data looked. This turned out be a natural fit as we were comparing three different features. Understanding the data from a 3D perspective helped us better understand what the two different 2D charts were showing us.

3D Clustering using Alpha Shapes

This was just an interesting experiment to see what happens to our customer segmentation if we use Alpha Shapes. According to MatLab, "The alpha shape of a set of points is a generalization of the convex hull and a subgraph of the Delaunay triangulation. That is, the convex hull is just one type of alpha shape, and the full family of alpha shapes can be derived from the Delaunay triangulation of a given point set."

In [235]:
#Coding Source: https://plot.ly/python/3d-point-clustering/#reference
#MatLab Source: https://www.mathworks.com/examples/matlab/mw/matlab-ex67295599-alpha-shapes      

import plotly.plotly as py
import pandas as pd
import plotly

df = test
df.head()      #segmentation.drop(['R_Quartile','F_Quartile','M_Quartile'

scatter = dict(
    mode = "markers",
    name = "y",
    type = "scatter3d",    
    x = df['Monetary'], y = df['Recency'], z = df['Frequency'],
    marker = dict( size=4, color="rgb(23, 190, 207)" )
)
clusters = dict(
    alphahull = 7,
    name = "y",
    opacity = 0.1,
    type = "mesh3d",    
    x = df['Monetary'], y = df['Recency'], z = df['Frequency']
)
layout = dict(
    title = '3d point clustering',
    scene = dict(
        xaxis = dict( zeroline=False ),
        yaxis = dict( zeroline=False ),
        zaxis = dict( zeroline=False ),
    )
)
figg = dict( data=[scatter, clusters], layout=layout )
plotly.offline.plot(figg, auto_open=True)

#The output is an html file. We add the screenshot below. 
Out[235]:
'file://C:\\Users\\mtool\\Documents\\Data Mining MSDS 7331\\Projects\\Git project\\labs\\MSDS7331_NorthCarolinaDataset\\Lab3\\temp-plot.html'
In [236]:
display(Image('https://i.imgur.com/XR2xuPa.jpg', width=900, unconfined=True))

Association Rule Mining

Create lists of different descriptions

First with association rules we have to create the rules. We are using a subset of the entire population datasets where the transactions are purchases. We are also using a copy of the original dataset including all transactions where the Customer ID was null. We are not relying on CustomerID for this analysis, rather we will be doing associations amongst the items purchased, so we would rather keep in these rows as they will help with our analysis.

Create lists of different descriptions

In [237]:
from apyori import apriori
In [238]:
marketing_data_with_ID = marketing_data_copy.copy()

marketing_data_with_ID.drop_duplicates(inplace = True)
In [239]:
# Create the second data set that retains all the Customer ID for the Association Rules
# Change the rows with missing  data in CustomerID column:
marketing_data.dropna(subset=['Description'], inplace=True)

country_map = {'Bahrain':999991, 'EIRE':999992, 'France':999993, 'Hong Kong':999994, 'Israel':999995, 
               'Portugal':999996, 'Spain':999997, 'Switzerland':999998, 'United Kingdom':999999, 'Unspecified':999990}

marketing_data_with_ID['CustomerID'] = np.where(marketing_data_with_ID.CustomerID.isnull(),marketing_data_with_ID.Country.map(country_map),
                                               marketing_data_with_ID.CustomerID)
In [240]:
for ind,el in enumerate(marketing_data_with_ID['Description']):
    if type(el) != str:
        marketing_data_with_ID.loc[ind,'Description']='Unknown'
    elif re.match('.*[a-zA-Z]+.*',el) is not None:
        pass
    else:
        marketing_data_with_ID.loc[ind,'Description']='Unknown'
        
# Create the TotalPrice feature
marketing_data_with_ID['TotalPrice'] = marketing_data_with_ID['Quantity']*marketing_data_with_ID['UnitPrice']
In [241]:
by_invoice_with_ID = marketing_data_with_ID.groupby(['InvoiceNo', 'InvoiceDate','Country','CustomerID'], as_index=False)[ 'UnitPrice','Description','Quantity','TotalPrice'].agg(lambda x: list(x))

for ind,el in enumerate(by_invoice_with_ID['Description']):
    if type(el[0]) != str:
        by_invoice_with_ID.drop(ind,inplace=True)

by_invoice_with_ID['TotalPrice'] = by_invoice_with_ID['TotalPrice'].apply(sum)
by_invoice_with_ID['QuantityTotal'] = by_invoice_with_ID['Quantity'].apply(sum)
by_invoice_with_ID['Description_as_string'] = by_invoice_with_ID['Description'].apply(lambda x: "-".join(sorted(x)))
In [242]:
by_invoice_with_ID['Transaction']=''

for index,row in by_invoice_with_ID.iterrows():
    if str(row['InvoiceNo']).startswith("C"):
        by_invoice_with_ID.loc[index,'Transaction'] = 'Cancel'
    elif str(row['InvoiceNo']).startswith("A"):
        by_invoice_with_ID.loc[index,'Transaction'] = 'Adjust'
    else:
        by_invoice_with_ID.loc[index,'Transaction'] = 'Purchase'

Plot of sales stats overtime

In [243]:
sales_data = by_invoice_with_ID[by_invoice_with_ID.Transaction=="Purchase"]
In [244]:
sales_data = sales_data.copy()
sales_data.loc[:,'Average_per_item_cost'] = sales_data.apply(lambda x: x['TotalPrice'] / sum(x['Quantity']),axis=1)
In [245]:
sales_data.head(5)
Out[245]:
InvoiceNo InvoiceDate Country CustomerID UnitPrice Description Quantity TotalPrice QuantityTotal Description_as_string Transaction Average_per_item_cost
0 536365 2010-12-01 08:26:00 United Kingdom 17850.0 [2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25] [WHITE HANGING HEART T-LIGHT HOLDER, WHITE MET... [6.0, 6.0, 8.0, 6.0, 6.0, 2.0, 6.0] 139.12 40.0 CREAM CUPID HEARTS COAT HANGER-GLASS STAR FROS... Purchase 3.478000
1 536366 2010-12-01 08:28:00 United Kingdom 17850.0 [1.85, 1.85] [HAND WARMER UNION JACK, HAND WARMER RED POLKA... [6.0, 6.0] 22.20 12.0 HAND WARMER RED POLKA DOT-HAND WARMER UNION JACK Purchase 1.850000
2 536367 2010-12-01 08:34:00 United Kingdom 13047.0 [1.69, 2.1, 2.1, 3.75, 1.65, 4.25, 4.95, 9.95,... [ASSORTED COLOUR BIRD ORNAMENT, POPPY'S PLAYHO... [32.0, 6.0, 6.0, 8.0, 6.0, 6.0, 3.0, 2.0, 3.0,... 278.73 83.0 ASSORTED COLOUR BIRD ORNAMENT-BOX OF 6 ASSORTE... Purchase 3.358193
3 536368 2010-12-01 08:34:00 United Kingdom 13047.0 [4.25, 4.95, 4.95, 4.95] [JAM MAKING SET WITH JARS, RED COAT RACK PARIS... [6.0, 3.0, 3.0, 3.0] 70.05 15.0 BLUE COAT RACK PARIS FASHION-JAM MAKING SET WI... Purchase 4.670000
4 536369 2010-12-01 08:35:00 United Kingdom 13047.0 [5.95] [BATH BUILDING BLOCK WORD] [3.0] 17.85 3.0 BATH BUILDING BLOCK WORD Purchase 5.950000
In [246]:
timeseries_sales_data = sales_data.drop(['CustomerID','Country','Description','Transaction'],axis=1)
In [247]:
times = pd.to_datetime(sales_data.InvoiceDate)

timeseries_sales_data_group = timeseries_sales_data.groupby(times.dt.date)
timeseries_sales_data_group_avg = timeseries_sales_data_group.median()
In [248]:
width = 5
lag1 = timeseries_sales_data_group_avg['TotalPrice'].shift(1)
lag3 = timeseries_sales_data_group_avg['TotalPrice'].shift(width - 1)
window = lag3.rolling(window=width)
smooth_TotalPrice = window.mean()
smooth_TotalPrice.name='smooth_TotalPrice'
smooth_TotalPrice = smooth_TotalPrice

lag1 = timeseries_sales_data_group_avg['QuantityTotal'].shift(1)
lag3 = timeseries_sales_data_group_avg['QuantityTotal'].shift(width - 1)
window = lag3.rolling(window=width)
smooth_QuantityTotal = window.mean()
smooth_QuantityTotal.name='smooth_QuantityTotal'
smooth_QuantityTotal = smooth_QuantityTotal


timeseries_sales_data_group_avg = pd.concat([timeseries_sales_data_group_avg, smooth_TotalPrice, smooth_QuantityTotal],axis=1)
In [249]:
fig = plt.figure(figsize=(16,10))
plt.plot(timeseries_sales_data_group_avg.index,timeseries_sales_data_group_avg.smooth_QuantityTotal, label='Average Quantity');
plt.plot(timeseries_sales_data_group_avg.index,timeseries_sales_data_group_avg.smooth_TotalPrice, label='Average Total Price')
plt.title('Average sales stats per day',size=20)
plt.legend()
plt.show()

Running Apriori for Association Rules

In [250]:
list_purchase_marketing_data_grouping_descriptions = []
for el in by_invoice_with_ID.loc[by_invoice_with_ID['Transaction']=='Purchase','Description']:
    if len(el)>1:
        list_purchase_marketing_data_grouping_descriptions.append(el)
In [251]:
purchase_rules = apriori(list_purchase_marketing_data_grouping_descriptions, min_support=0.005,min_confidence=.2, 
                         max_confidence=.95,min_lift=0.3,max_lift=100,min_length=2, max_length=3 )

purchase_results = list(purchase_rules)
In [252]:
Markdown("""
## Aproiri results

For the apriori algorithm we wanted to have a range of 2-3 transactions per relationship, meaning there would be 1-2 antecedents and 1 consequent.
We set the support low because we wanted to find as many relationships possible to use in our analysis. In the end you found {num_n} relationships that meet the minimum criteria.
""".format(num_n = len(purchase_results)))
Out[252]:

Aproiri results

For the apriori algorithm we wanted to have a range of 2-3 transactions per relationship, meaning there would be 1-2 antecedents and 1 consequent. We set the support low because we wanted to find as many relationships possible to use in our analysis. In the end you found 14122 relationships that meet the minimum criteria.

In [253]:
purchase_list = [list(x) for x in purchase_results]
antec = []
conseq = []
support = []
conf = []
lift = []


#Here we are creating a dataframe for each relationship from the apriori rule set
for elem in purchase_list:
    antec.append(str(elem[2][0][0]))
    conseq.append(str(elem[2][0][1]))
    support.append(elem[1])
    conf.append(elem[2][0][2])
    lift.append(elem[2][0][3])

    
purchase_df = pd.DataFrame(data={'antecedants':antec,'consequents':conseq,'support':support,
                                 'confidence':conf,'lift':lift})

#Here we are removing anything related to this data coming from frozensets
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"frozenset({'",""))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'frozenset({"',""))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'"})',''))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"'})",''))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r"', '",' & '))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.replace(r'", "',' & '))
purchase_df['antecedants'] = purchase_df['antecedants'].apply(lambda x: x.strip())


purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r"frozenset({'",""))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r'frozenset({"',""))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r'"})',''))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.replace(r"'})",''))
purchase_df['consequents'] = purchase_df['consequents'].apply(lambda x: x.strip())

purchase_df = purchase_df.sort_values(by='support',ascending=False)
In [254]:
NBR_TOP_VALUES = 25

ants = []
cons = []
loop_round = 0
for ant in purchase_df['antecedants']:
    if ant in ants:
        pass
    elif loop_round >= NBR_TOP_VALUES:
        break
    else:
        ants.append(ant)
        loop_round+=1
loop_round = 0
for con in purchase_df['consequents']:
    if con in cons:
        pass
    elif loop_round >= NBR_TOP_VALUES:
        break
    else:
        cons.append(con)
        loop_round+=1

top_support_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in top_support_apriori_values.index and rows['consequents'] in top_support_apriori_values.columns:
        top_support_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['support']
        
top_conf_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in top_conf_apriori_values.index and rows['consequents'] in top_conf_apriori_values.columns:
        top_conf_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['confidence']
        
top_lift_apriori_values = pd.DataFrame(index = ants, columns=cons, data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in top_lift_apriori_values.index and rows['consequents'] in top_lift_apriori_values.columns:
        top_lift_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['lift']
In [255]:
ant = purchase_df.antecedants.iloc[0]
con = purchase_df.consequents.iloc[0]
sup = purchase_df.support.iloc[0]
lift = purchase_df.lift.iloc[0]
conf = purchase_df.confidence.iloc[0]
Markdown("""
# Association Rules Rundown
With Association rules (implemented with apriori in Python), you find associative relationships between the antecedent and consequents. 
It can be easy to think of the relationship as an if then statement, if you purchase **{ant}** then there is a certain likelihood you will buy **{con}**. 

There are a number of ways to measure the relationship between the antecedent and the consequent. First the is support, which states what is the proportion of the entire population of transactions that the pair show up in, for **{ant}** and **{con}** **{sup}%** of the entire transactions have that pair.

Next, there is confidence, which indicates that out of all of the transactions that have the antecedent, include the consequent. This means that **{conf}%** of the transactions with **{ant}** include **{con}**.

Last there is lift, which indicate how much more likely the consequent is to be included when the antecedent is on the transaction. Which means **{con}** is **{lift}** times more likely to be purchased when **{ant}** is purchased.  
""".format(ant = ant, con=con, lift=round(lift,1), sup = round(sup*100,2), conf = round(conf*100,2)))
Out[255]:

Association Rules Rundown

With Association rules (implemented with apriori in Python), you find associative relationships between the antecedent and consequents. It can be easy to think of the relationship as an if then statement, if you purchase JUMBO BAG PINK POLKADOT then there is a certain likelihood you will buy JUMBO BAG RED RETROSPOT.

There are a number of ways to measure the relationship between the antecedent and the consequent. First the is support, which states what is the proportion of the entire population of transactions that the pair show up in, for JUMBO BAG PINK POLKADOT and JUMBO BAG RED RETROSPOT 4.45% of the entire transactions have that pair.

Next, there is confidence, which indicates that out of all of the transactions that have the antecedent, include the consequent. This means that 67.44% of the transactions with JUMBO BAG PINK POLKADOT include JUMBO BAG RED RETROSPOT.

Last there is lift, which indicate how much more likely the consequent is to be included when the antecedent is on the transaction. Which means JUMBO BAG RED RETROSPOT is 6.0 times more likely to be purchased when JUMBO BAG PINK POLKADOT is purchased.

In [256]:
ax = sns.clustermap(top_conf_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of confidence of items with the most support',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        text = ax.ax_heatmap.text(ind_j, ind_i, int(top_conf_apriori_values.loc[j_text, i_text]*100),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
In [257]:
ax = sns.clustermap(top_lift_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of lift of items with the most support',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        text = ax.ax_heatmap.text(ind_j, ind_i, int(top_lift_apriori_values.loc[j_text, i_text]),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
In [258]:
ax = sns.clustermap(top_support_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of support of items with the most support',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        text = ax.ax_heatmap.text(ind_j, ind_i, round(top_support_apriori_values.loc[j_text, i_text]*100,1),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);

Association rule sorted by support: Interpretation

These antecedents and consequents were obtained by sorting off of support, meaning the items selected represented the most common relationships in the transactions. The items seem to be related to everyday consumer products like lunch bags, bags, alarm clocks, and teacup & saucer combos.

It is interesting to see the differences between what provide the greatest lift as opposed to the most popular items. The Jumbo Bag Pink Polkadot is most often purchased with the Jumbo Bag Retrospot at over 4% of the transactions; while the Green Regency Teacup and Saucer gives the biggest lift to the Pink Regency Teacup and Saucer.

The cluster map helps to show that there are certain groups that form from associations.

In [259]:
most_common_conf_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index, 
                                  columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index, 
                                  data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in most_common_conf_apriori_values.index and rows['consequents'] in most_common_conf_apriori_values.columns:
        most_common_conf_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['confidence']
        

most_common_lift_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index, 
                                  columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index, 
                                  data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in most_common_lift_apriori_values.index and rows['consequents'] in most_common_lift_apriori_values.columns:
        most_common_lift_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['lift']
        
        
most_common_support_apriori_values = pd.DataFrame(index = purchase_df['antecedants'].value_counts()[:NBR_TOP_VALUES].index, 
                                  columns=purchase_df['consequents'].value_counts()[:NBR_TOP_VALUES].index, 
                                  data = np.zeros([NBR_TOP_VALUES,NBR_TOP_VALUES]))

for index, rows in purchase_df.iterrows():
    if rows['antecedants'] in most_common_support_apriori_values.index and rows['consequents'] in most_common_support_apriori_values.columns:
        most_common_support_apriori_values.at[rows['antecedants'],rows['consequents']]=rows['support']
In [260]:
ax = sns.clustermap(most_common_conf_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of confidence of most common descriptions',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        
        text = ax.ax_heatmap.text(ind_j, ind_i, int(most_common_conf_apriori_values.loc[j_text, i_text]*100),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
In [261]:
ax = sns.clustermap(most_common_lift_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of lift of most common descriptions',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        
        text = ax.ax_heatmap.text(ind_j, ind_i, int(most_common_lift_apriori_values.loc[j_text, i_text]),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);
In [262]:
ax = sns.clustermap(most_common_support_apriori_values, figsize=(16,10),cmap='RdBu');
ax.fig.suptitle('Clustermap of support of top common descriptions',size  = 28)
plt.setp(ax.ax_heatmap.xaxis.get_majorticklabels(), rotation=30, ha='right');

for ind_i,i in enumerate(ax.ax_heatmap.xaxis.get_ticklabels()):
    for ind_j,j in enumerate(ax.ax_heatmap.yaxis.get_ticklabels()):
        
        
        #Removes text added to tick labels
        i_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(i))
        i_text = re.sub(r"[\'\"]\)$",'',i_text)
        
        j_text = re.sub("Text\W(\d*\.\d|\d),\d.[\"\']",'',str(j))
        j_text = re.sub(r"[\'\"]\)$",'',j_text)

        text = ax.ax_heatmap.text(ind_j, ind_i, round(most_common_support_apriori_values.loc[j_text, i_text]*100,1),
                       ha="center", va="center", color="black",position=(ind_i+.5,ind_j+.5))

ax.ax_heatmap.set_xlabel('consequents', size = 20);
ax.ax_heatmap.set_ylabel('antecedants', size = 20);

With the graph above we are showing the items that show up most commonly in the antecedents and consequents. With this relationship we then display the relative confidence for each pair.

This graph is interesting because there are a couple of distinct sections with relative confidence. For this chart there seems to be a range of .30 to 1 for the confidence between the consequent and the antecedent. We see a lot of dotcom postage in the antecedent category. Looking at the timeseries graph of sales patterns over time you see there is increased sales due to increased units sold toward the end of the year, this could be a result of increased purchases for anticipated holiday sales.

Deployment

With increasing focus on customer engagement, management, and service, the models we’ve developed represents a growing sector in the marketing and business analytics space. For one, we found several software companies selling the RFM methodology encapsulated in a clustering or classification technique. Some of these companies include Putler, Optimove, Canopy Labs, and Clever Analytics. The needs and demand is definitely there.

Our performance considering the data set we have was above average. One improvement for having tighter cluster groups would be to apply our model to countries in isolation. As we illustrated in the Data Understanding section, geography played a role, directly or indirectly, on the frequency and volume of sales. Regardless, we wanted to test our model’s limits and it still generated numbers that can actually be used for segmentation of customers.

Going forward, we will not discuss hypothetical deployment by “other parties” or companies because the model we produced will actually be deployed on 8/14/2018 by Murtada Shubbar, a member of this team. Murtada works in the Pricing and Profitability department of a billion dollar plus distribution company, with over 80,000 customers. Surprisingly, the only customer segmentation that exists currently at that company is as follows:

In [264]:
display(Image('https://i.imgur.com/noJ1n6M.png', width=450, unconfined=True))
display(Image('https://i.imgur.com/RLe7IPQ.png', width=450, unconfined=True))

Customer Level represents a static customer assignment based on what Distribution Center (DC) managers think the customer brings into the company. Customer Industry represents the main focus of what the customer sells, although, most overlap in more than one area. From this combination, the department determines which customer will get which price point. In theory, because there are 5 different price points for each item sold, the customer that brings in the most sales will receive the most discounted price and so on.

Several problems arise with this methodology:

  1. The DC Manager is biased, and often to increase sales, they change customer level attributes to give them better pricing.

  2. The customer segmentation is static. The only way to reclassify all the customers is to make the DC manager go through them.

  3. The customer industry classification has no purpose in methodology of choosing which customer getting which price.

Thus, the weakness of the customer classification system that’s currently deployed at Murtada’s company is exposed. Only sales volume is used, hence only one of our variables: Monetary.

As we have indicated in the past, the model we’re using for customer segmentation is more dynamic, objective, and insightful. The model gives us a more holistic and historically more accurate picture of the customers. Furthermore, exploring the three dimensional nature of the customer, we’ve suggested researched action steps for the sales team. Actions such as implementing more aggressive pricing incentives or beneficially letting go of “lost customers.”

Murtada’s director plans to run a pilot with the new customer segmentations produced to test if they have increased conversion and revenue rates. This ultimately will also decrease marketing costs, since the department will not target low yielding customers.

Additional data is not needed for this model, however, an additional variable will help elevate the customer segmentation process to a new level. The new feature added to the RFM would be ‘V’, which stands for velocity. How fast is the customer growing? What’s their trajectory. This adds an additional dimension to the analysis: RFMV. The model will need to be updated twice a year, in the winter season and in the summer season. However, for other companies, they can updated it as much as they want.

Exceptional Work

Summary of Exceptional Work throughout the paper

$\textbf{3D Modeling}$ The additional 3D modeling demonstrates the relationship between the 3 primary metrics for customer behavior. The use of 3D modeling better shows the distinct clusters and how effective the different clustering techniques are.

$\textbf{Association Rules Modeling}$ The addition of association rule modeling shows the buying patterns associated with the items that are purchased. The three primary metrics associated with association rules (lift, support, and confidence) were mapped out using clustermaps.

$\textbf{Timeseries graphs}$ The purchasing patterns of time were displayed in a graph to show the changes in buying patterns over time. This helped reinforce some of the observations found in the types of items purchased for the association rule modeling.

$\textbf{Grouping of cancelled orders}$ We attempted to join the cancelled orders with the purchases but found a small share of the transactions matched. It was found that about 10% of the cancelled orders could be confidently matched up with a purchase invoice. We went off of description, price, invoice order (any purchases happening after a cancelled order were dropped) and quantity. With such a small matching rate we felt there was insufficient data to make any sort of proper analysis.

$\textbf{3D Modeling using Alpha Shapes}$ Experimental modeling technique that did not result in any additional insight.